Spark UI Troubleshooting: Data Skew vs Cluster Resource Bottlenecks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12 hours ago
How can Spark UI metrics be used to distinguish data skew from insufficient cluster resources?
When a Databricks job is slow, we usually look at Spark UI metrics such as task duration, shuffle read/write, spilled bytes, GC time, executor CPU utilization, and skewed task sizes.
However, some symptoms can overlap. For example, a long-running stage with high spill and a few slow tasks could be caused by data skew, insufficient executor memory, too few partitions, or an inefficient join strategy.
What is a reliable investigation sequence in Spark UI to identify the primary bottleneck?
In particular:
Which Spark UI metrics most strongly indicate data skew versus memory pressure?
How do you determine whether repartitioning, salting, broadcast joins, increasing executor memory, or enabling AQE is the right first action?
Are there practical thresholds or patterns that experienced teams use before changing cluster configuration?
How do you validate that the optimization fixed the root cause rather than only improving one run?
I’m looking for a repeatable troubleshooting approach rather than a one-off tuning recommendation.
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
5 hours ago
Hi @Dhivyadharshini,
Your question prompted me to write a blog post about it, so thank you for asking.
Here is the sequence I follow:
- Stages tab, sort by Duration descending. Pick the longest stage and click into it. Everything else is noise until you understand that one stage.
- Get three numbers from Task Metrics: Median task duration, Max task duration, and Median vs Max shuffle read size per task.
- Ask three questions in order:
- Is Max Duration more than 5x Median, and is shuffle read also skewed? That is data skew. Start with a broadcast join if the smaller side fits in memory; otherwise, use salting.
- Does spill appear on most tasks, or is GC time above 10% in the Executors tab? That is memory pressure. Increase the shuffle partitions before requesting more executor memory.
- Is the task count below 2x your executor core count? That is underparallelism. Raise spark.sql.shuffle.partitions or add an explicit repartition(). - If none of those fit, open the SQL/DataFrame tab and check the physical plan for cross joins, missing predicate pushdown, or a sort-merge join where broadcast would work.
- Validate the fix properly: confirm the underlying metric moved (GC to zero, Max/Median ratio below 2x), not just wall-clock time. Run on a cold cache and at full production data volume.
Check the blog and let me know if you have any questions. Happy to dig into any specific stage metrics.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***