Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Hi @Dhivyadharshini,
Your question prompted me to write a blog post about it, so thank you for asking.
Here is the sequence I follow:
- Stages tab, sort by Duration descending. Pick the longest stage and click into it. Everything else is noise until you understand that one stage.
- Get three numbers from Task Metrics: Median task duration, Max task duration, and Median vs Max shuffle read size per task.
- Ask three questions in order:
- Is Max Duration more than 5x Median, and is shuffle read also skewed? That is data skew. Start with a broadcast join if the smaller side fits in memory; otherwise, use salting.
- Does spill appear on most tasks, or is GC time above 10% in the Executors tab? That is memory pressure. Increase the shuffle partitions before requesting more executor memory.
- Is the task count below 2x your executor core count? That is underparallelism. Raise spark.sql.shuffle.partitions or add an explicit repartition(). - If none of those fit, open the SQL/DataFrame tab and check the physical plan for cross joins, missing predicate pushdown, or a sort-merge join where broadcast would work.
- Validate the fix properly: confirm the underlying metric moved (GC to zero, Max/Median ratio below 2x), not just wall-clock time. Run on a cold cache and at full production data volume.
Check the blog and let me know if you have any questions. Happy to dig into any specific stage metrics.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***