<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Reading Spark UI: A Repeatable Guide to Finding Performance Bottlenecks in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/reading-spark-ui-a-repeatable-guide-to-finding-performance/m-p/160574#M1320</link>
    <description>&lt;DIV style="max-width: 860px; margin: 0 auto; padding: 0 32px 80px; background: #FFFFFF; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif; color: #1b3139; line-height: 1.75; font-size: 17px;"&gt;
&lt;P&gt;A &lt;A href="https://community.databricks.com/t5/data-engineering/spark-ui-troubleshooting-data-skew-vs-cluster-resource/td-p/160517" target="_self"&gt;question&lt;/A&gt; came up in the community recently that I thought deserved more than a short answer. The question was around how to build a reliable investigation sequence for slow Spark jobs, specifically when symptoms overlap. A long-running stage with high spill and a few slow tasks could be data skew, insufficient executor memory, too few partitions, or an inefficient join strategy. The Spark UI has all the information you need to tell them apart, but only if you know where to look and in what order.&lt;/P&gt;
&lt;P&gt;I put together a lab notebook with three intentionally broken jobs, one per bottleneck type, ran them on a Databricks cluster with Photon disabled to expose the classic Spark signatures, and captured every screenshot from a real run. This post walks through what each bottleneck looks like in the UI, the fix, and how to confirm the fix actually worked. The goal is a sequence you can apply directly the next time a stage takes longer than it should.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 20px 24px; margin: 28px 0; border-radius: 6px;"&gt;
&lt;DIV style="font-size: 12px; font-weight: bold; color: #ff3621; text-transform: uppercase; letter-spacing: 1.5px; margin-bottom: 12px;"&gt;Key Takeaways&lt;/DIV&gt;
&lt;UL style="margin: 0; padding-left: 20px; color: #1b3139;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;Start in the Stages tab. Find the slowest stage, then ask three questions in order before touching any configuration.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;Data skew, memory pressure, and underparallelism each produce a distinct Spark UI signature. The Max/Median duration ratio and the spill distribution are the fastest discriminators.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;A single faster run is not validation. The underlying metric (GC time, spill, task ratio) must move, not just wall-clock time.&lt;/LI&gt;
&lt;LI style="margin-bottom: 0;"&gt;Enable AQE before doing any manual tuning. It resolves a large fraction of shuffle partition and broadcast join problems automatically.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;DIV style="background: #FAFBFC; border: 1px solid #E8ECF0; border-radius: 8px; padding: 20px 28px; margin: 24px 0 16px;"&gt;
&lt;DIV style="font-size: 13px; font-weight: bold; color: #ff3621; text-transform: uppercase; letter-spacing: 1.5px; margin-bottom: 12px;"&gt;What's in this post&lt;/DIV&gt;
&lt;OL style="margin: 0; padding-left: 20px;"&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#sequence" target="_blank"&gt;The investigation sequence&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#skew" target="_blank"&gt;Scenario 1: Data skew&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#skew-fix" target="_blank"&gt;Scenario 1b: Broadcast join fix&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#memory" target="_blank"&gt;Scenario 2: Memory pressure&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#memory-fix" target="_blank"&gt;Scenario 2b: Partition fix&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#parallelism" target="_blank"&gt;Scenario 3: Underparallelism&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#decision" target="_blank"&gt;Decision map and thresholds&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 0; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#validation" target="_blank"&gt;Validating the fix&lt;/A&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;/DIV&gt;
&lt;H2 id="sequence" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 32px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;The investigation sequence&lt;/H2&gt;
&lt;P&gt;Before clicking anything, open the &lt;STRONG&gt;Stages&lt;/STRONG&gt; tab and sort by Duration descending. Pick the longest stage. Everything else is noise until you understand that one stage.&lt;/P&gt;
&lt;P&gt;Inside the stage, you need three numbers before drawing any conclusions:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Median task duration&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Max task duration&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Median vs Max shuffle read size per task&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The ratio of Max to Median is your first discriminator. From there, three questions applied in order will identify the primary bottleneck in the large majority of cases.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #1B3139; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Question 1:&lt;/STRONG&gt; Is Max Duration more than 5x Median? If yes, check whether shuffle read bytes are also skewed. Both conditions together indicate data skew.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Question 2:&lt;/STRONG&gt; Does spill appear on most tasks (not just outliers)? Is GC time above 10% in the Executors tab? If yes, it is memory pressure.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Question 3:&lt;/STRONG&gt; Is task count well below 2x your executor core count? If yes, it is underparallelism.&lt;/DIV&gt;
&lt;P&gt;If none of those fit, open the &lt;STRONG&gt;SQL / DataFrame&lt;/STRONG&gt; tab and look at the physical plan. Missing predicate pushdown, unexpected cross joins, or a sort-merge join where broadcast would work are the next places to look.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="skew" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 1: Data skew&lt;/H2&gt;
&lt;P&gt;The lab job joins a 20 million-row fact table where 70% of rows share the same join key (key=1) against a 20-row reference table. Broadcast is explicitly disabled to force a sort-merge join, which means all 20 million rows must be shuffled and sorted by join key before the merge. The partition holding key=1 receives 14 million rows. Every other partition receives a handful.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;&lt;SPAN&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN&gt;import&lt;/SPAN&gt; functions &lt;SPAN&gt;as&lt;/SPAN&gt; F

spark.conf.set(&lt;SPAN&gt;"spark.sql.adaptive.enabled"&lt;/SPAN&gt;, &lt;SPAN&gt;"false"&lt;/SPAN&gt;)
spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"200"&lt;/SPAN&gt;)
spark.conf.set(&lt;SPAN&gt;"spark.sql.autoBroadcastJoinThreshold"&lt;/SPAN&gt;, &lt;SPAN&gt;"-1"&lt;/SPAN&gt;)  &lt;SPAN&gt;# force sort-merge join&lt;/SPAN&gt;

skew_data = (
    spark.range(&lt;SPAN&gt;20_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;,
        F.when(F.rand() &amp;lt; &lt;SPAN&gt;0.70&lt;/SPAN&gt;, F.lit(&lt;SPAN&gt;1&lt;/SPAN&gt;))
         .when(F.rand() &amp;lt; &lt;SPAN&gt;0.85&lt;/SPAN&gt;, F.lit(&lt;SPAN&gt;2&lt;/SPAN&gt;))
         .otherwise((F.rand() * &lt;SPAN&gt;18&lt;/SPAN&gt; + &lt;SPAN&gt;3&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;)))
    .withColumn(&lt;SPAN&gt;"value"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;1000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"payload"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 50)"&lt;/SPAN&gt;))
)

ref_data = (
    spark.range(&lt;SPAN&gt;1&lt;/SPAN&gt;, &lt;SPAN&gt;21&lt;/SPAN&gt;)
    .withColumnRenamed(&lt;SPAN&gt;"id"&lt;/SPAN&gt;, &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"label"&lt;/SPAN&gt;, F.concat(F.lit(&lt;SPAN&gt;"cat_"&lt;/SPAN&gt;), F.col(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"string"&lt;/SPAN&gt;)))
)

result = (
    skew_data.join(ref_data, &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"inner"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"label"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Task Metrics: the 24x ratio&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_0-1782413970227.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28285i68596E863E4E1DB4/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_0-1782413970227.png" alt="Ashwin_DSA_0-1782413970227.png" /&gt;&lt;/span&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
Stage 2 Task Metrics. Median duration 38ms, Max 0.9s: a 24x ratio. Shuffle Read is 0 bytes on 199 of 200 partitions; one partition holds the hot key.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The Task Metrics summary table is the most important screen in the Spark UI for diagnosing skew. Two rows matter here:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Duration&lt;/STRONG&gt;: Median 38ms, Max 0.9s. A 24x ratio. The 75th percentile sits at 53ms, meaning the outlier is not just slightly above average. It is categorically different from the rest of the distribution.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Shuffle Read Size&lt;/STRONG&gt;: Median 0 bytes / 0 records across 199 of 200 tasks. The Max partition receives all the data. Most tasks have nothing to process.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This is the defining skew signature: a bimodal distribution where the vast majority of tasks finish in milliseconds and one task runs for orders of magnitude longer.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Event Timeline: the visual tell&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_1-1782413989929.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28286i0C9CAD53CF88B402/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_1-1782413989929.png" alt="Ashwin_DSA_1-1782413989929.png" /&gt;&lt;/span&gt;&lt;BR /&gt;Event Timeline for Stage 2. A handful of long green (Executor Computing Time) bars at the top, followed by 190+ short bars. This bimodal shape is the skew signature.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The Event Timeline converts the numeric ratio into something immediately visual. Long green bars indicate CPU time spent processing the hot key partition. The colour here matters: green is Executor Computing Time, confirming the slow tasks are CPU-bound on data processing, not waiting on I/O or network. Memory pressure and underparallelism do not produce this bimodal shape, which makes it the fastest visual discriminator between the three bottleneck types.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;DAG Visualization: confirming the join strategy&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_2-1782414031747.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28287iBAA1D228A920BF64/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_2-1782414031747.png" alt="Ashwin_DSA_2-1782414031747.png" /&gt;&lt;/span&gt;DAG for Stage 2. Two Exchange (shuffle) nodes feed into two Sort operations, which merge via SortMergeJoin inside WholeStageCodegen. Both sides were fully shuffled and sorted by join key, making skew on the join key directly visible as a task outlier.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The DAG confirms the mechanism. Two Exchange nodes (one per join side) followed by Sort operations and a SortMergeJoin. This is the join strategy that makes skew dangerous: every row for key=1 lands on the same partition, and one task must process all of it alone.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="skew-fix" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 1b: Broadcast join fix&lt;/H2&gt;
&lt;P&gt;The reference table has 20 rows. It fits comfortably in executor memory. Enabling broadcast replicates it to every executor, eliminating the need to shuffle the join key entirely. The SortMergeJoin and its two Exchange stages disappear from the plan.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;&lt;SPAN&gt;# Re-enable broadcast: small table replicated to every executor, no shuffle on join key&lt;/SPAN&gt;
spark.conf.set(&lt;SPAN&gt;"spark.sql.autoBroadcastJoinThreshold"&lt;/SPAN&gt;, str(&lt;SPAN&gt;10&lt;/SPAN&gt; * &lt;SPAN&gt;1024&lt;/SPAN&gt; * &lt;SPAN&gt;1024&lt;/SPAN&gt;))

result = (
    skew_data.join(F.broadcast(ref_data), &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"inner"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"label"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Jobs comparison: the numbers&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_3-1782414084869.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28288i8DCA710A0D32A3BD/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_3-1782414084869.png" alt="Ashwin_DSA_3-1782414084869.png" /&gt;&lt;/span&gt;Jobs tab. Scenario 1 (no broadcast): 18s, 4 stages, 408 tasks. Scenario 1b (broadcast): 5s, 2 stages, 204 tasks. Half the stages, half the tasks, 3.6x faster.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The improvement is not incremental. The fix collapses two of the four stages entirely. In production, where a skewed sort-merge join might anchor a 30-minute stage, this is the difference between a job completing before business hours and one that misses its SLA.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Stages after the fix&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_4-1782414108177.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28289i85F02D8D01FC5C8A/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_4-1782414108177.png" alt="Ashwin_DSA_4-1782414108177.png" /&gt;&lt;/span&gt;Scenario 1b Stages. Two stages: a 4-task scan and a 200-task aggregation shuffle. The 200-task shuffle join stage that produced the 24x task ratio is absent.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The fix did not make the slow task faster. It eliminated the stage that contained the slow task. That is a fundamentally different kind of improvement and it is the one to aim for with skew. If the data allows it, removing the shuffle is better than optimizing within it.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;When broadcast is not an option&lt;/STRONG&gt;: if the smaller side of the join is too large to broadcast (typically above 500MB to 1GB depending on executor memory), the next tool is salting. Add a random suffix to the hot key before the join to distribute its rows across multiple partitions, then strip the suffix in a second pass. AQE's skew join optimization (&lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #E8ECF0; padding: 2px 5px; border-radius: 3px; font-size: 13.5px;"&gt;spark.sql.adaptive.skewJoin.enabled&lt;/CODE&gt;) automates a version of this when the skewed partition exceeds the configured threshold.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="memory" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 2: Memory pressure&lt;/H2&gt;
&lt;P&gt;The lab job processes 4 million wide rows, each approximately 400 bytes of string data across five columns, shuffled into only 20 partitions. Each task receives roughly 8 MB of data. A &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt; aggregation forces each task to hold the full list of strings in heap before writing the result. The combination of large per-task data and an in-memory accumulator produces GC pressure across all tasks.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"20"&lt;/SPAN&gt;)  &lt;SPAN&gt;# intentionally too few&lt;/SPAN&gt;

wide_data = (
    spark.range(&lt;SPAN&gt;4_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;, (F.col(&lt;SPAN&gt;"id"&lt;/SPAN&gt;) % &lt;SPAN&gt;20&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_a"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_b"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_c"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_d"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_e"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;1000&lt;/SPAN&gt;)
)

result = (
    wide_data
    .repartition(&lt;SPAN&gt;20&lt;/SPAN&gt;, &lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .agg(
        F.sum(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;),
        F.collect_list(&lt;SPAN&gt;"col_a"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"all_a"&lt;/SPAN&gt;),  &lt;SPAN&gt;# forces large in-memory buffer&lt;/SPAN&gt;
        F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;)
    )
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Stages: large shuffle volume, few tasks&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_5-1782414148744.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28290iF88D60C7753820FB/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_5-1782414148744.png" alt="Ashwin_DSA_5-1782414148744.png" /&gt;&lt;/span&gt;Scenario 2 Stages. Stage 8: 20 tasks, 173.5 MiB shuffle read, 12 seconds. Compare to Scenario 1's Stage 2: 200 tasks, 7 KiB shuffle read, 9 seconds. The absolute shuffle volume flags the problem before you even click in.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The 173.5 MiB total shuffle read across 20 tasks means each task processes approximately 8.7 MiB of serialized data, before accounting for the deserialized in-memory representation which is larger. This is where the memory constraint originates.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Task Metrics: GC time is the signal&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_6-1782414171745.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28291i70EBCBFA8FEE4F77/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_6-1782414171745.png" alt="Ashwin_DSA_6-1782414171745.png" /&gt;&lt;/span&gt;Stage 8 Task Metrics. Duration: Median 2s, Max 9s (4.5x ratio, much tighter than skew's 24x). GC Time: Max 4s on a 9s task, 44% of task time in garbage collection. Shuffle Read Median is 8.3 MiB / 200k records per task, confirming large uniform per-task data volumes.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;Two things separate this from skew:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Duration ratio&lt;/STRONG&gt;: Max/Median is 4.5x, compared to 24x in Scenario 1. All the heavy tasks are slow together, not one outlier dragging the stage.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;GC Time&lt;/STRONG&gt;: Max 4 seconds on a 9-second task. That is 44% of task time in garbage collection. The JVM is constantly reclaiming the large string arrays built by &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt;. This is the clearest memory pressure indicator available in the Spark UI.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Note there is no Spill row in this screenshot. On this cluster with 10.7 GiB executor memory, the data stays in heap but causes severe GC pressure. In production with smaller executor memory relative to partition size, the same root cause produces disk spill instead. The fix is identical in both cases: reduce per-task data volume.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #1B3139; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Executors tab note&lt;/STRONG&gt;: on a multi-executor cluster, check the Executors tab. The GC Time column shows cumulative GC per executor. Uniformly high GC across all executors points to a partition sizing problem. GC concentrated on specific executors points to uneven data distribution.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="memory-fix" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 2b: Partition fix&lt;/H2&gt;
&lt;P&gt;The fix raises shuffle partitions from 20 to 400 and removes the &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt; aggregation. Each task now receives a much smaller data chunk, and no large in-memory buffers are built.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"400"&lt;/SPAN&gt;)

result = (
    wide_data
    .repartition(&lt;SPAN&gt;400&lt;/SPAN&gt;, &lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    &lt;SPAN&gt;# collect_list removed: no large in-memory accumulator&lt;/SPAN&gt;
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_11-1782414614629.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28296i7F3A49006216025E/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_11-1782414614629.png" alt="Ashwin_DSA_11-1782414614629.png" /&gt;&lt;/span&gt;Scenario 2b Stages. Stage 10: 400 tasks, 39.0 MiB shuffle read (down from 173.5 MiB), 6 seconds (down from 12 seconds).&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_12-1782414635602.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28297i7F369B63913346EC/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_12-1782414635602.png" alt="Ashwin_DSA_12-1782414635602.png" /&gt;&lt;/span&gt;Scenario 2b Task Metrics. GC Time: 0ms across all percentiles including Max. Duration: Median 10ms, Max 0.2s. The root cause is eliminated, not just reduced.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; margin: 20px 0; font-size: 15px;"&gt;
&lt;THEAD&gt;
&lt;TR style="border-bottom: 2px solid #1B3139;"&gt;
&lt;TH style="text-align: left; padding: 10px 12px; color: #1b3139;"&gt;Metric&lt;/TH&gt;
&lt;TH style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;20 partitions&lt;/TH&gt;
&lt;TH style="text-align: right; padding: 10px 12px; color: #1b3139;"&gt;400 partitions&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Stage duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;12s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;6s&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Median task duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;2s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;10ms&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max task duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;9s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;0.2s&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max GC Time&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;4s (44% of task)&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;0ms&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 10px 12px;"&gt;Total shuffle read&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;173.5 MiB&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;39.0 MiB&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;GC time dropping to zero is the validation signal. Not the wall-clock improvement (which is real), but the fact that the JVM has nothing to collect. The objects that were triggering pressure no longer exist at that size in heap.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="parallelism" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 3: Underparallelism&lt;/H2&gt;
&lt;P&gt;The lab job runs a 3 million-row aggregation after repartitioning to 4 partitions. There is no skew and no spill. The cluster is simply not given enough tasks to use its available cores.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"4"&lt;/SPAN&gt;)  &lt;SPAN&gt;# intentionally too low&lt;/SPAN&gt;

under_data = (
    spark.range(&lt;SPAN&gt;3_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"category"&lt;/SPAN&gt;, (F.rand() * &lt;SPAN&gt;100&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"string"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"value"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;500&lt;/SPAN&gt;)
)

result = (
    under_data
    .repartition(&lt;SPAN&gt;4&lt;/SPAN&gt;)  &lt;SPAN&gt;# 4 tasks regardless of cluster size&lt;/SPAN&gt;
    .groupBy(&lt;SPAN&gt;"category"&lt;/SPAN&gt;)
    .agg(F.avg(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"avg_val"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    .orderBy(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;, ascending=&lt;SPAN&gt;False&lt;/SPAN&gt;)
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()

&lt;SPAN&gt;# Fix: raise partitions to match data volume&lt;/SPAN&gt;
spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"100"&lt;/SPAN&gt;)
result = (
    under_data.repartition(&lt;SPAN&gt;100&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"category"&lt;/SPAN&gt;)
    .agg(F.avg(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"avg_val"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    .orderBy(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;, ascending=&lt;SPAN&gt;False&lt;/SPAN&gt;)
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_13-1782414658971.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28298i0430183E617B73C0/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_13-1782414658971.png" alt="Ashwin_DSA_13-1782414658971.png" /&gt;&lt;/span&gt;Scenario 3 Stages. Three stages, all with 4/4 tasks. On a production cluster with 32 or 64 cores, the same configuration leaves the vast majority of the cluster idle throughout the job.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The underparallelism tell in the Stages tab is the task count. If every stage runs a small, fixed number of tasks regardless of data volume, the shuffle partition count is almost certainly the constraint. Check &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.shuffle.partitions&lt;/CODE&gt; and compare it to 2x your executor core count as a starting floor.&lt;/P&gt;
&lt;P&gt;Unlike skew and memory pressure, underparallelism produces no spill, no GC pressure, and a tight Max/Median ratio. All tasks run cleanly. The job simply uses a fraction of available parallelism, so it takes proportionally longer than it needs to.&lt;/P&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_14-1782414675999.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28299iF1DD90C43FC681A0/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_14-1782414675999.png" alt="Ashwin_DSA_14-1782414675999.png" /&gt;&lt;/span&gt;Scenario 3b Stages. 100 tasks in the aggregation stage, 94 in the sort stage. On a production cluster with 32 cores, the 4-task version wastes 28 cores per wave. The 100-task version keeps the cluster busy and completes in roughly 1/8 the wall-clock time.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;HR /&gt;
&lt;H2 id="decision" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Decision map and practical thresholds&lt;/H2&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; margin: 16px 0; font-size: 14px;"&gt;
&lt;THEAD&gt;
&lt;TR style="background: #1B3139; color: #ffffff;"&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;Symptom&lt;/TH&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;Root cause&lt;/TH&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;First action&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max/Median &amp;gt; 5x, shuffle read skewed&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Data skew&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Broadcast join if small side fits; salting if not&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Spill on most tasks or GC &amp;gt; 10%&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Memory pressure&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;More partitions before adding executor memory&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Task count &amp;lt; 2x executor cores&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Underparallelism&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Raise &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 5px; border-radius: 3px; font-size: 13px;"&gt;spark.sql.shuffle.partitions&lt;/CODE&gt; or add &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 5px; border-radius: 3px; font-size: 13px;"&gt;repartition()&lt;/CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;None of the above&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Plan problem&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;SQL tab: check for cross joins, missing predicate pushdown, wrong join strategy&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Thresholds experienced teams use&lt;/H3&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Skew threshold&lt;/STRONG&gt;: any task reading more than 3x the median shuffle bytes warrants investigation. AQE's skew join optimization fires at 256MB by default. Lowering it to 64MB catches problems earlier: &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=67108864&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Spill threshold&lt;/STRONG&gt;: any disk spill is worth addressing. Even 100MB of spill indicates the executor memory-to-partition-size ratio is wrong.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;GC threshold&lt;/STRONG&gt;: GC time above 10% of task time in the Executors tab means memory is a primary constraint. Above 20%, it is the dominant cause of slowness.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Broadcast threshold&lt;/STRONG&gt;: &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.autoBroadcastJoinThreshold&lt;/CODE&gt; defaults to 10MB. In practice, tables up to 500MB to 1GB broadcast safely on modern clusters if executor memory is adequate. Check the physical plan in the SQL tab to confirm which join strategy Spark chose.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Partition sizing&lt;/STRONG&gt;: target 128 to 256MB of input data per partition for shuffle-heavy stages. For compute-intensive stages with complex UDFs, target 64MB to avoid GC pressure from large in-memory objects.&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Enable AQE first&lt;/STRONG&gt;: set &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #E8ECF0; padding: 2px 5px; border-radius: 3px; font-size: 13.5px;"&gt;spark.sql.adaptive.enabled=true&lt;/CODE&gt; before doing any manual tuning. AQE resolves the majority of shuffle partition count and broadcast join problems automatically. Use the investigation sequence above for what AQE does not fix: severe skew on low-cardinality keys, memory pressure from large individual task sizes, and first-stage underparallelism before AQE has seen statistics.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="validation" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Validating the fix&lt;/H2&gt;
&lt;P&gt;A single faster run is not enough. Here is what to check after applying a fix:&lt;/P&gt;
&lt;OL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;The metric moved, not just the clock.&lt;/STRONG&gt; If you salted for skew, confirm the Max/Median ratio dropped below 2x. If you increased partitions for memory pressure, confirm GC time dropped to zero or near zero, not just reduced. Wall-clock improvement with no change in the underlying metric means something else is limiting performance.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Spill is zero, not reduced.&lt;/STRONG&gt; Spill dropping from 10GB to 2GB means you improved the memory-to-partition ratio but did not fully resolve it. The correct partition size produces zero spill.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Run on a cold cache.&lt;/STRONG&gt; The second run of a job often benefits from cached shuffle files from the first run. Force a fresh run by clearing cache or changing the input path, otherwise the improvement may be partially artificial.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Run at full production data volume.&lt;/STRONG&gt; Skew and memory pressure are data-volume-dependent. A fix that works on a 10% sample frequently fails at full volume because the skewed key's frequency is nonlinear and per-partition data volume changes.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Check across a time window.&lt;/STRONG&gt; For recurring jobs, pull 7 days of run history after applying the fix and confirm duration variance dropped. A job that averages fast but spikes 3x on weekends still has a skew or partition imbalance problem the average obscures.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Hope this is useful for diagnosing your slow stages. If you have additional techniques or thresholds that work well in your environment, please share them in the comments. These patterns improve with more data points.&lt;/P&gt;
&lt;/DIV&gt;</description>
    <pubDate>Thu, 25 Jun 2026 19:11:51 GMT</pubDate>
    <dc:creator>Ashwin_DSA</dc:creator>
    <dc:date>2026-06-25T19:11:51Z</dc:date>
    <item>
      <title>Reading Spark UI: A Repeatable Guide to Finding Performance Bottlenecks</title>
      <link>https://community.databricks.com/t5/community-articles/reading-spark-ui-a-repeatable-guide-to-finding-performance/m-p/160574#M1320</link>
      <description>&lt;DIV style="max-width: 860px; margin: 0 auto; padding: 0 32px 80px; background: #FFFFFF; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif; color: #1b3139; line-height: 1.75; font-size: 17px;"&gt;
&lt;P&gt;A &lt;A href="https://community.databricks.com/t5/data-engineering/spark-ui-troubleshooting-data-skew-vs-cluster-resource/td-p/160517" target="_self"&gt;question&lt;/A&gt; came up in the community recently that I thought deserved more than a short answer. The question was around how to build a reliable investigation sequence for slow Spark jobs, specifically when symptoms overlap. A long-running stage with high spill and a few slow tasks could be data skew, insufficient executor memory, too few partitions, or an inefficient join strategy. The Spark UI has all the information you need to tell them apart, but only if you know where to look and in what order.&lt;/P&gt;
&lt;P&gt;I put together a lab notebook with three intentionally broken jobs, one per bottleneck type, ran them on a Databricks cluster with Photon disabled to expose the classic Spark signatures, and captured every screenshot from a real run. This post walks through what each bottleneck looks like in the UI, the fix, and how to confirm the fix actually worked. The goal is a sequence you can apply directly the next time a stage takes longer than it should.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 20px 24px; margin: 28px 0; border-radius: 6px;"&gt;
&lt;DIV style="font-size: 12px; font-weight: bold; color: #ff3621; text-transform: uppercase; letter-spacing: 1.5px; margin-bottom: 12px;"&gt;Key Takeaways&lt;/DIV&gt;
&lt;UL style="margin: 0; padding-left: 20px; color: #1b3139;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;Start in the Stages tab. Find the slowest stage, then ask three questions in order before touching any configuration.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;Data skew, memory pressure, and underparallelism each produce a distinct Spark UI signature. The Max/Median duration ratio and the spill distribution are the fastest discriminators.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;A single faster run is not validation. The underlying metric (GC time, spill, task ratio) must move, not just wall-clock time.&lt;/LI&gt;
&lt;LI style="margin-bottom: 0;"&gt;Enable AQE before doing any manual tuning. It resolves a large fraction of shuffle partition and broadcast join problems automatically.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;DIV style="background: #FAFBFC; border: 1px solid #E8ECF0; border-radius: 8px; padding: 20px 28px; margin: 24px 0 16px;"&gt;
&lt;DIV style="font-size: 13px; font-weight: bold; color: #ff3621; text-transform: uppercase; letter-spacing: 1.5px; margin-bottom: 12px;"&gt;What's in this post&lt;/DIV&gt;
&lt;OL style="margin: 0; padding-left: 20px;"&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#sequence" target="_blank"&gt;The investigation sequence&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#skew" target="_blank"&gt;Scenario 1: Data skew&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#skew-fix" target="_blank"&gt;Scenario 1b: Broadcast join fix&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#memory" target="_blank"&gt;Scenario 2: Memory pressure&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#memory-fix" target="_blank"&gt;Scenario 2b: Partition fix&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#parallelism" target="_blank"&gt;Scenario 3: Underparallelism&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#decision" target="_blank"&gt;Decision map and thresholds&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 0; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#validation" target="_blank"&gt;Validating the fix&lt;/A&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;/DIV&gt;
&lt;H2 id="sequence" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 32px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;The investigation sequence&lt;/H2&gt;
&lt;P&gt;Before clicking anything, open the &lt;STRONG&gt;Stages&lt;/STRONG&gt; tab and sort by Duration descending. Pick the longest stage. Everything else is noise until you understand that one stage.&lt;/P&gt;
&lt;P&gt;Inside the stage, you need three numbers before drawing any conclusions:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Median task duration&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Max task duration&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Median vs Max shuffle read size per task&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The ratio of Max to Median is your first discriminator. From there, three questions applied in order will identify the primary bottleneck in the large majority of cases.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #1B3139; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Question 1:&lt;/STRONG&gt; Is Max Duration more than 5x Median? If yes, check whether shuffle read bytes are also skewed. Both conditions together indicate data skew.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Question 2:&lt;/STRONG&gt; Does spill appear on most tasks (not just outliers)? Is GC time above 10% in the Executors tab? If yes, it is memory pressure.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Question 3:&lt;/STRONG&gt; Is task count well below 2x your executor core count? If yes, it is underparallelism.&lt;/DIV&gt;
&lt;P&gt;If none of those fit, open the &lt;STRONG&gt;SQL / DataFrame&lt;/STRONG&gt; tab and look at the physical plan. Missing predicate pushdown, unexpected cross joins, or a sort-merge join where broadcast would work are the next places to look.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="skew" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 1: Data skew&lt;/H2&gt;
&lt;P&gt;The lab job joins a 20 million-row fact table where 70% of rows share the same join key (key=1) against a 20-row reference table. Broadcast is explicitly disabled to force a sort-merge join, which means all 20 million rows must be shuffled and sorted by join key before the merge. The partition holding key=1 receives 14 million rows. Every other partition receives a handful.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;&lt;SPAN&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN&gt;import&lt;/SPAN&gt; functions &lt;SPAN&gt;as&lt;/SPAN&gt; F

spark.conf.set(&lt;SPAN&gt;"spark.sql.adaptive.enabled"&lt;/SPAN&gt;, &lt;SPAN&gt;"false"&lt;/SPAN&gt;)
spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"200"&lt;/SPAN&gt;)
spark.conf.set(&lt;SPAN&gt;"spark.sql.autoBroadcastJoinThreshold"&lt;/SPAN&gt;, &lt;SPAN&gt;"-1"&lt;/SPAN&gt;)  &lt;SPAN&gt;# force sort-merge join&lt;/SPAN&gt;

skew_data = (
    spark.range(&lt;SPAN&gt;20_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;,
        F.when(F.rand() &amp;lt; &lt;SPAN&gt;0.70&lt;/SPAN&gt;, F.lit(&lt;SPAN&gt;1&lt;/SPAN&gt;))
         .when(F.rand() &amp;lt; &lt;SPAN&gt;0.85&lt;/SPAN&gt;, F.lit(&lt;SPAN&gt;2&lt;/SPAN&gt;))
         .otherwise((F.rand() * &lt;SPAN&gt;18&lt;/SPAN&gt; + &lt;SPAN&gt;3&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;)))
    .withColumn(&lt;SPAN&gt;"value"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;1000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"payload"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 50)"&lt;/SPAN&gt;))
)

ref_data = (
    spark.range(&lt;SPAN&gt;1&lt;/SPAN&gt;, &lt;SPAN&gt;21&lt;/SPAN&gt;)
    .withColumnRenamed(&lt;SPAN&gt;"id"&lt;/SPAN&gt;, &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"label"&lt;/SPAN&gt;, F.concat(F.lit(&lt;SPAN&gt;"cat_"&lt;/SPAN&gt;), F.col(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"string"&lt;/SPAN&gt;)))
)

result = (
    skew_data.join(ref_data, &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"inner"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"label"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Task Metrics: the 24x ratio&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_0-1782413970227.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28285i68596E863E4E1DB4/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_0-1782413970227.png" alt="Ashwin_DSA_0-1782413970227.png" /&gt;&lt;/span&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
Stage 2 Task Metrics. Median duration 38ms, Max 0.9s: a 24x ratio. Shuffle Read is 0 bytes on 199 of 200 partitions; one partition holds the hot key.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The Task Metrics summary table is the most important screen in the Spark UI for diagnosing skew. Two rows matter here:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Duration&lt;/STRONG&gt;: Median 38ms, Max 0.9s. A 24x ratio. The 75th percentile sits at 53ms, meaning the outlier is not just slightly above average. It is categorically different from the rest of the distribution.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Shuffle Read Size&lt;/STRONG&gt;: Median 0 bytes / 0 records across 199 of 200 tasks. The Max partition receives all the data. Most tasks have nothing to process.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This is the defining skew signature: a bimodal distribution where the vast majority of tasks finish in milliseconds and one task runs for orders of magnitude longer.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Event Timeline: the visual tell&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_1-1782413989929.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28286i0C9CAD53CF88B402/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_1-1782413989929.png" alt="Ashwin_DSA_1-1782413989929.png" /&gt;&lt;/span&gt;&lt;BR /&gt;Event Timeline for Stage 2. A handful of long green (Executor Computing Time) bars at the top, followed by 190+ short bars. This bimodal shape is the skew signature.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The Event Timeline converts the numeric ratio into something immediately visual. Long green bars indicate CPU time spent processing the hot key partition. The colour here matters: green is Executor Computing Time, confirming the slow tasks are CPU-bound on data processing, not waiting on I/O or network. Memory pressure and underparallelism do not produce this bimodal shape, which makes it the fastest visual discriminator between the three bottleneck types.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;DAG Visualization: confirming the join strategy&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_2-1782414031747.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28287iBAA1D228A920BF64/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_2-1782414031747.png" alt="Ashwin_DSA_2-1782414031747.png" /&gt;&lt;/span&gt;DAG for Stage 2. Two Exchange (shuffle) nodes feed into two Sort operations, which merge via SortMergeJoin inside WholeStageCodegen. Both sides were fully shuffled and sorted by join key, making skew on the join key directly visible as a task outlier.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The DAG confirms the mechanism. Two Exchange nodes (one per join side) followed by Sort operations and a SortMergeJoin. This is the join strategy that makes skew dangerous: every row for key=1 lands on the same partition, and one task must process all of it alone.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="skew-fix" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 1b: Broadcast join fix&lt;/H2&gt;
&lt;P&gt;The reference table has 20 rows. It fits comfortably in executor memory. Enabling broadcast replicates it to every executor, eliminating the need to shuffle the join key entirely. The SortMergeJoin and its two Exchange stages disappear from the plan.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;&lt;SPAN&gt;# Re-enable broadcast: small table replicated to every executor, no shuffle on join key&lt;/SPAN&gt;
spark.conf.set(&lt;SPAN&gt;"spark.sql.autoBroadcastJoinThreshold"&lt;/SPAN&gt;, str(&lt;SPAN&gt;10&lt;/SPAN&gt; * &lt;SPAN&gt;1024&lt;/SPAN&gt; * &lt;SPAN&gt;1024&lt;/SPAN&gt;))

result = (
    skew_data.join(F.broadcast(ref_data), &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"inner"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"label"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Jobs comparison: the numbers&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_3-1782414084869.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28288i8DCA710A0D32A3BD/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_3-1782414084869.png" alt="Ashwin_DSA_3-1782414084869.png" /&gt;&lt;/span&gt;Jobs tab. Scenario 1 (no broadcast): 18s, 4 stages, 408 tasks. Scenario 1b (broadcast): 5s, 2 stages, 204 tasks. Half the stages, half the tasks, 3.6x faster.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The improvement is not incremental. The fix collapses two of the four stages entirely. In production, where a skewed sort-merge join might anchor a 30-minute stage, this is the difference between a job completing before business hours and one that misses its SLA.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Stages after the fix&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_4-1782414108177.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28289i85F02D8D01FC5C8A/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_4-1782414108177.png" alt="Ashwin_DSA_4-1782414108177.png" /&gt;&lt;/span&gt;Scenario 1b Stages. Two stages: a 4-task scan and a 200-task aggregation shuffle. The 200-task shuffle join stage that produced the 24x task ratio is absent.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The fix did not make the slow task faster. It eliminated the stage that contained the slow task. That is a fundamentally different kind of improvement and it is the one to aim for with skew. If the data allows it, removing the shuffle is better than optimizing within it.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;When broadcast is not an option&lt;/STRONG&gt;: if the smaller side of the join is too large to broadcast (typically above 500MB to 1GB depending on executor memory), the next tool is salting. Add a random suffix to the hot key before the join to distribute its rows across multiple partitions, then strip the suffix in a second pass. AQE's skew join optimization (&lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #E8ECF0; padding: 2px 5px; border-radius: 3px; font-size: 13.5px;"&gt;spark.sql.adaptive.skewJoin.enabled&lt;/CODE&gt;) automates a version of this when the skewed partition exceeds the configured threshold.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="memory" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 2: Memory pressure&lt;/H2&gt;
&lt;P&gt;The lab job processes 4 million wide rows, each approximately 400 bytes of string data across five columns, shuffled into only 20 partitions. Each task receives roughly 8 MB of data. A &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt; aggregation forces each task to hold the full list of strings in heap before writing the result. The combination of large per-task data and an in-memory accumulator produces GC pressure across all tasks.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"20"&lt;/SPAN&gt;)  &lt;SPAN&gt;# intentionally too few&lt;/SPAN&gt;

wide_data = (
    spark.range(&lt;SPAN&gt;4_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;, (F.col(&lt;SPAN&gt;"id"&lt;/SPAN&gt;) % &lt;SPAN&gt;20&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_a"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_b"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_c"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_d"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_e"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;1000&lt;/SPAN&gt;)
)

result = (
    wide_data
    .repartition(&lt;SPAN&gt;20&lt;/SPAN&gt;, &lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .agg(
        F.sum(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;),
        F.collect_list(&lt;SPAN&gt;"col_a"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"all_a"&lt;/SPAN&gt;),  &lt;SPAN&gt;# forces large in-memory buffer&lt;/SPAN&gt;
        F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;)
    )
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Stages: large shuffle volume, few tasks&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_5-1782414148744.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28290iF88D60C7753820FB/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_5-1782414148744.png" alt="Ashwin_DSA_5-1782414148744.png" /&gt;&lt;/span&gt;Scenario 2 Stages. Stage 8: 20 tasks, 173.5 MiB shuffle read, 12 seconds. Compare to Scenario 1's Stage 2: 200 tasks, 7 KiB shuffle read, 9 seconds. The absolute shuffle volume flags the problem before you even click in.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The 173.5 MiB total shuffle read across 20 tasks means each task processes approximately 8.7 MiB of serialized data, before accounting for the deserialized in-memory representation which is larger. This is where the memory constraint originates.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Task Metrics: GC time is the signal&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_6-1782414171745.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28291i70EBCBFA8FEE4F77/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_6-1782414171745.png" alt="Ashwin_DSA_6-1782414171745.png" /&gt;&lt;/span&gt;Stage 8 Task Metrics. Duration: Median 2s, Max 9s (4.5x ratio, much tighter than skew's 24x). GC Time: Max 4s on a 9s task, 44% of task time in garbage collection. Shuffle Read Median is 8.3 MiB / 200k records per task, confirming large uniform per-task data volumes.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;Two things separate this from skew:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Duration ratio&lt;/STRONG&gt;: Max/Median is 4.5x, compared to 24x in Scenario 1. All the heavy tasks are slow together, not one outlier dragging the stage.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;GC Time&lt;/STRONG&gt;: Max 4 seconds on a 9-second task. That is 44% of task time in garbage collection. The JVM is constantly reclaiming the large string arrays built by &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt;. This is the clearest memory pressure indicator available in the Spark UI.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Note there is no Spill row in this screenshot. On this cluster with 10.7 GiB executor memory, the data stays in heap but causes severe GC pressure. In production with smaller executor memory relative to partition size, the same root cause produces disk spill instead. The fix is identical in both cases: reduce per-task data volume.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #1B3139; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Executors tab note&lt;/STRONG&gt;: on a multi-executor cluster, check the Executors tab. The GC Time column shows cumulative GC per executor. Uniformly high GC across all executors points to a partition sizing problem. GC concentrated on specific executors points to uneven data distribution.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="memory-fix" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 2b: Partition fix&lt;/H2&gt;
&lt;P&gt;The fix raises shuffle partitions from 20 to 400 and removes the &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt; aggregation. Each task now receives a much smaller data chunk, and no large in-memory buffers are built.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"400"&lt;/SPAN&gt;)

result = (
    wide_data
    .repartition(&lt;SPAN&gt;400&lt;/SPAN&gt;, &lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    &lt;SPAN&gt;# collect_list removed: no large in-memory accumulator&lt;/SPAN&gt;
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_11-1782414614629.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28296i7F3A49006216025E/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_11-1782414614629.png" alt="Ashwin_DSA_11-1782414614629.png" /&gt;&lt;/span&gt;Scenario 2b Stages. Stage 10: 400 tasks, 39.0 MiB shuffle read (down from 173.5 MiB), 6 seconds (down from 12 seconds).&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_12-1782414635602.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28297i7F369B63913346EC/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_12-1782414635602.png" alt="Ashwin_DSA_12-1782414635602.png" /&gt;&lt;/span&gt;Scenario 2b Task Metrics. GC Time: 0ms across all percentiles including Max. Duration: Median 10ms, Max 0.2s. The root cause is eliminated, not just reduced.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; margin: 20px 0; font-size: 15px;"&gt;
&lt;THEAD&gt;
&lt;TR style="border-bottom: 2px solid #1B3139;"&gt;
&lt;TH style="text-align: left; padding: 10px 12px; color: #1b3139;"&gt;Metric&lt;/TH&gt;
&lt;TH style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;20 partitions&lt;/TH&gt;
&lt;TH style="text-align: right; padding: 10px 12px; color: #1b3139;"&gt;400 partitions&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Stage duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;12s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;6s&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Median task duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;2s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;10ms&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max task duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;9s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;0.2s&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max GC Time&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;4s (44% of task)&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;0ms&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 10px 12px;"&gt;Total shuffle read&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;173.5 MiB&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;39.0 MiB&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;GC time dropping to zero is the validation signal. Not the wall-clock improvement (which is real), but the fact that the JVM has nothing to collect. The objects that were triggering pressure no longer exist at that size in heap.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="parallelism" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 3: Underparallelism&lt;/H2&gt;
&lt;P&gt;The lab job runs a 3 million-row aggregation after repartitioning to 4 partitions. There is no skew and no spill. The cluster is simply not given enough tasks to use its available cores.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"4"&lt;/SPAN&gt;)  &lt;SPAN&gt;# intentionally too low&lt;/SPAN&gt;

under_data = (
    spark.range(&lt;SPAN&gt;3_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"category"&lt;/SPAN&gt;, (F.rand() * &lt;SPAN&gt;100&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"string"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"value"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;500&lt;/SPAN&gt;)
)

result = (
    under_data
    .repartition(&lt;SPAN&gt;4&lt;/SPAN&gt;)  &lt;SPAN&gt;# 4 tasks regardless of cluster size&lt;/SPAN&gt;
    .groupBy(&lt;SPAN&gt;"category"&lt;/SPAN&gt;)
    .agg(F.avg(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"avg_val"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    .orderBy(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;, ascending=&lt;SPAN&gt;False&lt;/SPAN&gt;)
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()

&lt;SPAN&gt;# Fix: raise partitions to match data volume&lt;/SPAN&gt;
spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"100"&lt;/SPAN&gt;)
result = (
    under_data.repartition(&lt;SPAN&gt;100&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"category"&lt;/SPAN&gt;)
    .agg(F.avg(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"avg_val"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    .orderBy(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;, ascending=&lt;SPAN&gt;False&lt;/SPAN&gt;)
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_13-1782414658971.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28298i0430183E617B73C0/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_13-1782414658971.png" alt="Ashwin_DSA_13-1782414658971.png" /&gt;&lt;/span&gt;Scenario 3 Stages. Three stages, all with 4/4 tasks. On a production cluster with 32 or 64 cores, the same configuration leaves the vast majority of the cluster idle throughout the job.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The underparallelism tell in the Stages tab is the task count. If every stage runs a small, fixed number of tasks regardless of data volume, the shuffle partition count is almost certainly the constraint. Check &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.shuffle.partitions&lt;/CODE&gt; and compare it to 2x your executor core count as a starting floor.&lt;/P&gt;
&lt;P&gt;Unlike skew and memory pressure, underparallelism produces no spill, no GC pressure, and a tight Max/Median ratio. All tasks run cleanly. The job simply uses a fraction of available parallelism, so it takes proportionally longer than it needs to.&lt;/P&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_14-1782414675999.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28299iF1DD90C43FC681A0/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_14-1782414675999.png" alt="Ashwin_DSA_14-1782414675999.png" /&gt;&lt;/span&gt;Scenario 3b Stages. 100 tasks in the aggregation stage, 94 in the sort stage. On a production cluster with 32 cores, the 4-task version wastes 28 cores per wave. The 100-task version keeps the cluster busy and completes in roughly 1/8 the wall-clock time.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;HR /&gt;
&lt;H2 id="decision" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Decision map and practical thresholds&lt;/H2&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; margin: 16px 0; font-size: 14px;"&gt;
&lt;THEAD&gt;
&lt;TR style="background: #1B3139; color: #ffffff;"&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;Symptom&lt;/TH&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;Root cause&lt;/TH&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;First action&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max/Median &amp;gt; 5x, shuffle read skewed&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Data skew&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Broadcast join if small side fits; salting if not&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Spill on most tasks or GC &amp;gt; 10%&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Memory pressure&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;More partitions before adding executor memory&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Task count &amp;lt; 2x executor cores&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Underparallelism&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Raise &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 5px; border-radius: 3px; font-size: 13px;"&gt;spark.sql.shuffle.partitions&lt;/CODE&gt; or add &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 5px; border-radius: 3px; font-size: 13px;"&gt;repartition()&lt;/CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;None of the above&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Plan problem&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;SQL tab: check for cross joins, missing predicate pushdown, wrong join strategy&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Thresholds experienced teams use&lt;/H3&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Skew threshold&lt;/STRONG&gt;: any task reading more than 3x the median shuffle bytes warrants investigation. AQE's skew join optimization fires at 256MB by default. Lowering it to 64MB catches problems earlier: &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=67108864&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Spill threshold&lt;/STRONG&gt;: any disk spill is worth addressing. Even 100MB of spill indicates the executor memory-to-partition-size ratio is wrong.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;GC threshold&lt;/STRONG&gt;: GC time above 10% of task time in the Executors tab means memory is a primary constraint. Above 20%, it is the dominant cause of slowness.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Broadcast threshold&lt;/STRONG&gt;: &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.autoBroadcastJoinThreshold&lt;/CODE&gt; defaults to 10MB. In practice, tables up to 500MB to 1GB broadcast safely on modern clusters if executor memory is adequate. Check the physical plan in the SQL tab to confirm which join strategy Spark chose.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Partition sizing&lt;/STRONG&gt;: target 128 to 256MB of input data per partition for shuffle-heavy stages. For compute-intensive stages with complex UDFs, target 64MB to avoid GC pressure from large in-memory objects.&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Enable AQE first&lt;/STRONG&gt;: set &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #E8ECF0; padding: 2px 5px; border-radius: 3px; font-size: 13.5px;"&gt;spark.sql.adaptive.enabled=true&lt;/CODE&gt; before doing any manual tuning. AQE resolves the majority of shuffle partition count and broadcast join problems automatically. Use the investigation sequence above for what AQE does not fix: severe skew on low-cardinality keys, memory pressure from large individual task sizes, and first-stage underparallelism before AQE has seen statistics.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="validation" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Validating the fix&lt;/H2&gt;
&lt;P&gt;A single faster run is not enough. Here is what to check after applying a fix:&lt;/P&gt;
&lt;OL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;The metric moved, not just the clock.&lt;/STRONG&gt; If you salted for skew, confirm the Max/Median ratio dropped below 2x. If you increased partitions for memory pressure, confirm GC time dropped to zero or near zero, not just reduced. Wall-clock improvement with no change in the underlying metric means something else is limiting performance.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Spill is zero, not reduced.&lt;/STRONG&gt; Spill dropping from 10GB to 2GB means you improved the memory-to-partition ratio but did not fully resolve it. The correct partition size produces zero spill.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Run on a cold cache.&lt;/STRONG&gt; The second run of a job often benefits from cached shuffle files from the first run. Force a fresh run by clearing cache or changing the input path, otherwise the improvement may be partially artificial.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Run at full production data volume.&lt;/STRONG&gt; Skew and memory pressure are data-volume-dependent. A fix that works on a 10% sample frequently fails at full volume because the skewed key's frequency is nonlinear and per-partition data volume changes.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Check across a time window.&lt;/STRONG&gt; For recurring jobs, pull 7 days of run history after applying the fix and confirm duration variance dropped. A job that averages fast but spikes 3x on weekends still has a skew or partition imbalance problem the average obscures.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Hope this is useful for diagnosing your slow stages. If you have additional techniques or thresholds that work well in your environment, please share them in the comments. These patterns improve with more data points.&lt;/P&gt;
&lt;/DIV&gt;</description>
      <pubDate>Thu, 25 Jun 2026 19:11:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/reading-spark-ui-a-repeatable-guide-to-finding-performance/m-p/160574#M1320</guid>
      <dc:creator>Ashwin_DSA</dc:creator>
      <dc:date>2026-06-25T19:11:51Z</dc:date>
    </item>
    <item>
      <title>Re: Reading Spark UI: A Repeatable Guide to Finding Performance Bottlenecks</title>
      <link>https://community.databricks.com/t5/community-articles/reading-spark-ui-a-repeatable-guide-to-finding-performance/m-p/160680#M1321</link>
      <description>&lt;P&gt;Thanks for this! Very insightful and detailed. Your sequence for diagnosing slow Spark jobs when symptoms overlap is exactly what I needed. Bookmarked this for my team.&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jun 2026 14:08:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/reading-spark-ui-a-repeatable-guide-to-finding-performance/m-p/160680#M1321</guid>
      <dc:creator>srinivasu_nalla</dc:creator>
      <dc:date>2026-06-26T14:08:20Z</dc:date>
    </item>
  </channel>
</rss>

