<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Spark UI Troubleshooting: Data Skew vs Cluster Resource Bottlenecks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-ui-troubleshooting-data-skew-vs-cluster-resource/m-p/160517#M54894</link>
    <description>&lt;P&gt;&lt;STRONG&gt;How can Spark UI metrics be used to distinguish data skew from insufficient cluster resources?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When a Databricks job is slow, we usually look at Spark UI metrics such as task duration, shuffle read/write, spilled bytes, GC time, executor CPU utilization, and skewed task sizes.&lt;/P&gt;&lt;P&gt;However, some symptoms can overlap. For example, a long-running stage with high spill and a few slow tasks could be caused by data skew, insufficient executor memory, too few partitions, or an inefficient join strategy.&lt;/P&gt;&lt;P&gt;What is a reliable investigation sequence in Spark UI to identify the primary bottleneck?&lt;/P&gt;&lt;P&gt;In particular:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Which Spark UI metrics most strongly indicate data skew versus memory pressure?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How do you determine whether repartitioning, salting, broadcast joins, increasing executor memory, or enabling AQE is the right first action?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Are there practical thresholds or patterns that experienced teams use before changing cluster configuration?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How do you validate that the optimization fixed the root cause rather than only improving one run?&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I’m looking for a repeatable troubleshooting approach rather than a one-off tuning recommendation.&lt;/P&gt;</description>
    <pubDate>Thu, 25 Jun 2026 12:23:29 GMT</pubDate>
    <dc:creator>Dhivyadharshini</dc:creator>
    <dc:date>2026-06-25T12:23:29Z</dc:date>
    <item>
      <title>Spark UI Troubleshooting: Data Skew vs Cluster Resource Bottlenecks</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-ui-troubleshooting-data-skew-vs-cluster-resource/m-p/160517#M54894</link>
      <description>&lt;P&gt;&lt;STRONG&gt;How can Spark UI metrics be used to distinguish data skew from insufficient cluster resources?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When a Databricks job is slow, we usually look at Spark UI metrics such as task duration, shuffle read/write, spilled bytes, GC time, executor CPU utilization, and skewed task sizes.&lt;/P&gt;&lt;P&gt;However, some symptoms can overlap. For example, a long-running stage with high spill and a few slow tasks could be caused by data skew, insufficient executor memory, too few partitions, or an inefficient join strategy.&lt;/P&gt;&lt;P&gt;What is a reliable investigation sequence in Spark UI to identify the primary bottleneck?&lt;/P&gt;&lt;P&gt;In particular:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Which Spark UI metrics most strongly indicate data skew versus memory pressure?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How do you determine whether repartitioning, salting, broadcast joins, increasing executor memory, or enabling AQE is the right first action?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Are there practical thresholds or patterns that experienced teams use before changing cluster configuration?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How do you validate that the optimization fixed the root cause rather than only improving one run?&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I’m looking for a repeatable troubleshooting approach rather than a one-off tuning recommendation.&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jun 2026 12:23:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-ui-troubleshooting-data-skew-vs-cluster-resource/m-p/160517#M54894</guid>
      <dc:creator>Dhivyadharshini</dc:creator>
      <dc:date>2026-06-25T12:23:29Z</dc:date>
    </item>
    <item>
      <title>Re: Spark UI Troubleshooting: Data Skew vs Cluster Resource Bottlenecks</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-ui-troubleshooting-data-skew-vs-cluster-resource/m-p/160580#M54907</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/239166"&gt;@Dhivyadharshini&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Your question prompted me to write a &lt;A href="https://community.databricks.com/t5/data-engineering/reading-spark-ui-a-repeatable-guide-to-finding-performance/td-p/160574" target="_blank"&gt;blog&lt;/A&gt; post about it, so thank you for asking.&lt;/P&gt;
&lt;P&gt;Here is the sequence I follow:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Stages tab, sort by Duration descending. Pick the longest stage and click into it. Everything else is noise until you understand that one stage.&lt;/LI&gt;
&lt;LI&gt;Get three numbers from Task Metrics: Median task duration, Max task duration, and Median vs Max shuffle read size per task.&lt;/LI&gt;
&lt;LI&gt;Ask three questions in order:&lt;BR /&gt;- Is Max Duration more than 5x Median, and is shuffle read also skewed? That is data skew. Start with a broadcast join if the smaller side fits in memory; otherwise, use salting.&lt;BR /&gt;- Does spill appear on most tasks, or is GC time above 10% in the Executors tab? That is memory pressure. Increase the shuffle partitions before requesting more executor memory.&lt;BR /&gt;- Is the task count below 2x your executor core count? That is underparallelism. Raise spark.sql.shuffle.partitions or add an explicit repartition().&lt;/LI&gt;
&lt;LI&gt;If none of those fit, open the SQL/DataFrame tab and check the physical plan for cross joins, missing predicate pushdown, or a sort-merge join where broadcast would work.&lt;/LI&gt;
&lt;LI&gt;Validate the fix properly: confirm the underlying metric moved (GC to zero, Max/Median ratio below 2x), not just wall-clock time. Run on a cold cache and at full production data volume.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Check the blog and let me know if you have any questions.&amp;nbsp;Happy to dig into any specific stage metrics.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;FONT size="2" color="#FF6600"&gt;&lt;STRONG&gt;&lt;I&gt;If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.&lt;/I&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;I&gt;&lt;/I&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jun 2026 19:18:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-ui-troubleshooting-data-skew-vs-cluster-resource/m-p/160580#M54907</guid>
      <dc:creator>Ashwin_DSA</dc:creator>
      <dc:date>2026-06-25T19:18:09Z</dc:date>
    </item>
  </channel>
</rss>

