How can Spark UI metrics be used to distinguish data skew from insufficient cluster resources?
When a Databricks job is slow, we usually look at Spark UI metrics such as task duration, shuffle read/write, spilled bytes, GC time, executor CPU utilization, and skewed task sizes.
However, some symptoms can overlap. For example, a long-running stage with high spill and a few slow tasks could be caused by data skew, insufficient executor memory, too few partitions, or an inefficient join strategy.
What is a reliable investigation sequence in Spark UI to identify the primary bottleneck?
In particular:
Which Spark UI metrics most strongly indicate data skew versus memory pressure?
How do you determine whether repartitioning, salting, broadcast joins, increasing executor memory, or enabling AQE is the right first action?
Are there practical thresholds or patterns that experienced teams use before changing cluster configuration?
How do you validate that the optimization fixed the root cause rather than only improving one run?
I’m looking for a repeatable troubleshooting approach rather than a one-off tuning recommendation.