Databricks Community

Dhivyadharshini · 3 weeks ago

How can Spark UI metrics be used to distinguish data skew from insufficient cluster resources?

When a Databricks job is slow, we usually look at Spark UI metrics such as task duration, shuffle read/write, spilled bytes, GC time, executor CPU utilization, and skewed task sizes.

However, some symptoms can overlap. For example, a long-running stage with high spill and a few slow tasks could be caused by data skew, insufficient executor memory, too few partitions, or an inefficient join strategy.

What is a reliable investigation sequence in Spark UI to identify the primary bottleneck?

In particular:

Which Spark UI metrics most strongly indicate data skew versus memory pressure?
How do you determine whether repartitioning, salting, broadcast joins, increasing executor memory, or enabling AQE is the right first action?
Are there practical thresholds or patterns that experienced teams use before changing cluster configuration?
How do you validate that the optimization fixed the root cause rather than only improving one run?

I’m looking for a repeatable troubleshooting approach rather than a one-off tuning recommendation.

Ashwin_DSA · 3 weeks ago

Hi @Dhivyadharshini,

Your question prompted me to write a blog post about it, so thank you for asking.

Here is the sequence I follow:

Stages tab, sort by Duration descending. Pick the longest stage and click into it. Everything else is noise until you understand that one stage.
Get three numbers from Task Metrics: Median task duration, Max task duration, and Median vs Max shuffle read size per task.
Ask three questions in order:
- Is Max Duration more than 5x Median, and is shuffle read also skewed? That is data skew. Start with a broadcast join if the smaller side fits in memory; otherwise, use salting.
- Does spill appear on most tasks, or is GC time above 10% in the Executors tab? That is memory pressure. Increase the shuffle partitions before requesting more executor memory.
- Is the task count below 2x your executor core count? That is underparallelism. Raise spark.sql.shuffle.partitions or add an explicit repartition().
If none of those fit, open the SQL/DataFrame tab and check the physical plan for cross joins, missing predicate pushdown, or a sort-merge join where broadcast would work.
Validate the fix properly: confirm the underlying metric moved (GC to zero, Max/Median ratio below 2x), not just wall-clock time. Run on a cold cache and at full production data volume.

Check the blog and let me know if you have any questions. Happy to dig into any specific stage metrics.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

Ashwin_DSA · 3 weeks ago

Hi @Dhivyadharshini,

Your question prompted me to write a blog post about it, so thank you for asking.

Here is the sequence I follow:

Stages tab, sort by Duration descending. Pick the longest stage and click into it. Everything else is noise until you understand that one stage.
Get three numbers from Task Metrics: Median task duration, Max task duration, and Median vs Max shuffle read size per task.
Ask three questions in order:
- Is Max Duration more than 5x Median, and is shuffle read also skewed? That is data skew. Start with a broadcast join if the smaller side fits in memory; otherwise, use salting.
- Does spill appear on most tasks, or is GC time above 10% in the Executors tab? That is memory pressure. Increase the shuffle partitions before requesting more executor memory.
- Is the task count below 2x your executor core count? That is underparallelism. Raise spark.sql.shuffle.partitions or add an explicit repartition().
If none of those fit, open the SQL/DataFrame tab and check the physical plan for cross joins, missing predicate pushdown, or a sort-merge join where broadcast would work.
Validate the fix properly: confirm the underlying metric moved (GC to zero, Max/Median ratio below 2x), not just wall-clock time. Run on a cold cache and at full production data volume.

Check the blog and let me know if you have any questions. Happy to dig into any specific stage metrics.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

Vibiksha · 3 weeks ago

A simple way to troubleshoot a slow Spark job using Spark UI is:

Check task duration
- A few very slow tasks → Likely data skew.
- Most tasks are slow → Likely cluster resource or execution issue.
Check Spark UI metrics
- Large differences in shuffle read/task size → Data skew.
- High spill, high GC time, or OOM errors → Memory pressure.
Choose the right fix
- Data skew → Repartition, salting, or enable AQE.
- Small lookup table → Use a broadcast join.
- Memory issues → Increase executor memory or optimize partitions.
Validate the result
- Compare Spark UI before and after the change.
- Confirm lower task time, less spill, lower GC, and balanced task sizes.

Rule of thumb: A few slow tasks usually mean data skew. Slow performance across all tasks usually means memory or resource limitations.

Vibiksha J

Databricks Community

Spark UI Troubleshooting: Data Skew vs Cluster Resource Bottlenecks

Upcoming Community BrickTalk: Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap

The Next Wave of Enterprise AI | Webinar