Range join hint does not help in faster execution ...

ashokv · ‎06-13-2025

Spark SQL execution did not complete even after 12 hours, i ran it on i3.xlarge with 4 worker nodes.

only two worker nodes showed as running, with CPU at 100%

what should i do differently?

--SQL

INSERT into attribute_results

...

SELECT /*+ BROADCAST(t) RANGE_JOIN(ta, 46) */

..,

..

from

(select * from transactions where txn_date ='2025-05-01' ) t

INNER JOIN transaction_attributes ta

ON t.txn_date BETWEEN ta.analysis_start_date AND ta.analysis_end_date;

select count(1) from transactions;

2,782,521

select count(1) from transactions where txn_date ='2025-05-01' ;

92,387

select count(1) from transaction_attributes ta

where to_date('2025-05-01') BETWEEN ta.analysis_start_date AND ta.analysis_end_date

43,589,999

saiprasadambati · ‎06-14-2025

can you share the result of the below query ?

select count(1) from transaction_attributes where analysis_start_date = '2025-05-01' and analysis_end_date = '2025-05-01' ,

If it has multiple entries , the join condition will lead to cross join and hence the spark job is running for ever

ashokv · ‎06-15-2025

Yes, it has multiple entries and i expect it that way. cross join is also expected.

select count(1) from transaction_attributes where analysis_start_date = '2025-05-01' and analysis_end_date = '2025-05-01'

338489

Range join hint does not help in faster execution of spark sql