Range join hint does not help in faster execution of spark sql
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-13-2025 11:24 AM
Spark SQL execution did not complete even after 12 hours, i ran it on i3.xlarge with 4 worker nodes.
only two worker nodes showed as running, with CPU at 100%
what should i do differently?
--SQL
INSERT into attribute_results
...
SELECT /*+ BROADCAST(t) RANGE_JOIN(ta, 46) */
..,
..,
..
from
(select * from transactions where txn_date ='2025-05-01' ) t
INNER JOIN transaction_attributes ta
ON t.txn_date BETWEEN ta.analysis_start_date AND ta.analysis_end_date;
select count(1) from transactions;
2,782,521
select count(1) from transactions where txn_date ='2025-05-01' ;
92,387
select count(1) from transaction_attributes ta
where to_date('2025-05-01') BETWEEN ta.analysis_start_date AND ta.analysis_end_date
43,589,999
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-14-2025 12:18 PM
can you share the result of the below query ?
select count(1) from transaction_attributes where analysis_start_date = '2025-05-01' and analysis_end_date = '2025-05-01' ,
If it has multiple entries , the join condition will lead to cross join and hence the spark job is running for ever
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-15-2025 01:45 AM
Yes, it has multiple entries and i expect it that way. cross join is also expected.
select count(1) from transaction_attributes where analysis_start_date = '2025-05-01' and analysis_end_date = '2025-05-01'
338489