cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Range join hint does not help in faster execution of spark sql

ashokv
New Contributor II
Spark SQL execution did not complete even after 12 hours, i ran it on i3.xlarge with 4 worker nodes.
only two worker nodes showed as running, with CPU at 100%
what should i do differently?
 
--SQL
INSERT into  attribute_results
...
SELECT  /*+ BROADCAST(t) RANGE_JOIN(ta, 46) */
  ..,
      ..,
      ..     
from 
 (select * from transactions where txn_date ='2025-05-01'  ) t
INNER JOIN transaction_attributes ta
ON t.txn_date BETWEEN ta.analysis_start_date AND ta.analysis_end_date;
 
select count(1) from transactions;
2,782,521
 
select count(1) from transactions where txn_date ='2025-05-01' ; 
92,387
 
select count(1) from transaction_attributes ta 
where to_date('2025-05-01') BETWEEN ta.analysis_start_date AND ta.analysis_end_date
43,589,999
2 REPLIES 2

saiprasadambati
New Contributor III

can you share the result of the below query ?

select count(1) from transaction_attributes where analysis_start_date = '2025-05-01' and analysis_end_date = '2025-05-01' , 

 

If it has multiple entries , the join condition will lead to cross join and hence the spark job is running for ever 

Yes, it has multiple entries and i expect it that way. cross join is also expected. 

select count(1) from transaction_attributes where analysis_start_date = '2025-05-01' and analysis_end_date = '2025-05-01'

338489