Hi @sandy_123
The "Multiple failures in stage materialization" error at line 120 is caused by a massive shuffle bottleneck
Check the spark UI and try to understand the reason for the failure such as RPC, heartbeat error etc
- Window Function Computed Multiple Times (Lines 26-117)
- The window function computing geo_routing_flag using partition by concat(...) forces Spark to shuffle ALL data by visitor ID
- This happens separately for each data source, causing multiple expensive shuffles
- If you have skewed visitor IDs (some with thousands of records), certain partitions become extremely large
- Memory Pressure from Chain of Shuffles
- By the time you reach the final write at line 120, accumulated shuffle operations have created memory pressure
- Straggler tasks (some partitions processing 10-100x more data) cause the job to appear "stuck"
- Write Operation Triggering Final Shuffle
- The .mode("overwrite") operation is the final straw that triggers one more massive shuffle
- If using dynamic partitioning, this multiplies the cost further
To overcome the issue try the cache the frequently used df and see if you can tune the size of the executor by adding any config or see if increasing the executor helps in resolving the issue.