<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/understanding-coalesce-skewed-joins-and-why-aqe-doesn-t-always/m-p/115586#M405</link>
    <description>&lt;P&gt;In Spark, &lt;STRONG&gt;data skew&lt;/STRONG&gt; can be the silent killer of performance. One wide partition pulling in 90% of the data?&lt;/P&gt;&lt;P&gt;But even with &lt;STRONG&gt;AQE (Adaptive Query Execution)&lt;/STRONG&gt; turned on in Databricks, &lt;STRONG&gt;skewness isn't always automatically identified&lt;/STRONG&gt;— and here’s why.&lt;/P&gt;&lt;H3&gt;What Is coalesce() in Spark?&lt;/H3&gt;&lt;P class=""&gt;The coalesce(n) function reduces the number of partitions in a DataFrame &lt;STRONG&gt;without a full shuffle&lt;/STRONG&gt;, usually used to compact data after a wide transformation like a join or groupBy. It’s especially useful when:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P class=""&gt;You're writing output to disk (e.g., Parquet, Delta) and want fewer files.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;You're post-processing skewed data and want to redistribute load more evenly.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;But&amp;nbsp;&lt;SPAN&gt;this can result to disproportionately large volume of data remained concentrated in a single partition, leading to severe data skew — where one task handled the majority of the workload while others remained underutilized.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;H3&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Data Skew.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16035iC8544FCBC378E874/image-size/large?v=v2&amp;amp;px=999" role="button" title="Data Skew.png" alt="Data Skew.png" /&gt;&lt;/span&gt;&lt;/H3&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;Shouldn&lt;SPAN&gt;’t AQE&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;Adaptive Qu&lt;/SPAN&gt;&lt;SPAN&gt;ery Execution) have caught&amp;nbsp;this?&lt;/SPAN&gt;&lt;/H3&gt;&lt;P class=""&gt;coalesce(n)&amp;nbsp;operation &lt;STRONG&gt;does not trigger a full shuffle like&amp;nbsp;repartition(n)&lt;/STRONG&gt;.&amp;nbsp;&lt;SPAN&gt;There is therefore no signal to Catalyst for run-time optimizing to see if&amp;nbsp;&lt;/SPAN&gt;AQE&lt;SPAN&gt;&amp;nbsp;could be applied - as there is no full shuffle to be detected, which serves as an optimization, precursor condition for invoking&amp;nbsp;&lt;/SPAN&gt;AQE.&lt;/P&gt;&lt;H3&gt;Conclusion&lt;/H3&gt;&lt;P class=""&gt;AQE didn’t help — not because it failed, but because we never gave it the chance.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 15 Apr 2025 21:15:46 GMT</pubDate>
    <dc:creator>techgeorge</dc:creator>
    <dc:date>2025-04-15T21:15:46Z</dc:date>
    <item>
      <title>Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene</title>
      <link>https://community.databricks.com/t5/community-articles/understanding-coalesce-skewed-joins-and-why-aqe-doesn-t-always/m-p/115586#M405</link>
      <description>&lt;P&gt;In Spark, &lt;STRONG&gt;data skew&lt;/STRONG&gt; can be the silent killer of performance. One wide partition pulling in 90% of the data?&lt;/P&gt;&lt;P&gt;But even with &lt;STRONG&gt;AQE (Adaptive Query Execution)&lt;/STRONG&gt; turned on in Databricks, &lt;STRONG&gt;skewness isn't always automatically identified&lt;/STRONG&gt;— and here’s why.&lt;/P&gt;&lt;H3&gt;What Is coalesce() in Spark?&lt;/H3&gt;&lt;P class=""&gt;The coalesce(n) function reduces the number of partitions in a DataFrame &lt;STRONG&gt;without a full shuffle&lt;/STRONG&gt;, usually used to compact data after a wide transformation like a join or groupBy. It’s especially useful when:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P class=""&gt;You're writing output to disk (e.g., Parquet, Delta) and want fewer files.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;You're post-processing skewed data and want to redistribute load more evenly.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;But&amp;nbsp;&lt;SPAN&gt;this can result to disproportionately large volume of data remained concentrated in a single partition, leading to severe data skew — where one task handled the majority of the workload while others remained underutilized.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;H3&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Data Skew.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16035iC8544FCBC378E874/image-size/large?v=v2&amp;amp;px=999" role="button" title="Data Skew.png" alt="Data Skew.png" /&gt;&lt;/span&gt;&lt;/H3&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;Shouldn&lt;SPAN&gt;’t AQE&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;Adaptive Qu&lt;/SPAN&gt;&lt;SPAN&gt;ery Execution) have caught&amp;nbsp;this?&lt;/SPAN&gt;&lt;/H3&gt;&lt;P class=""&gt;coalesce(n)&amp;nbsp;operation &lt;STRONG&gt;does not trigger a full shuffle like&amp;nbsp;repartition(n)&lt;/STRONG&gt;.&amp;nbsp;&lt;SPAN&gt;There is therefore no signal to Catalyst for run-time optimizing to see if&amp;nbsp;&lt;/SPAN&gt;AQE&lt;SPAN&gt;&amp;nbsp;could be applied - as there is no full shuffle to be detected, which serves as an optimization, precursor condition for invoking&amp;nbsp;&lt;/SPAN&gt;AQE.&lt;/P&gt;&lt;H3&gt;Conclusion&lt;/H3&gt;&lt;P class=""&gt;AQE didn’t help — not because it failed, but because we never gave it the chance.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Apr 2025 21:15:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/understanding-coalesce-skewed-joins-and-why-aqe-doesn-t-always/m-p/115586#M405</guid>
      <dc:creator>techgeorge</dc:creator>
      <dc:date>2025-04-15T21:15:46Z</dc:date>
    </item>
    <item>
      <title>Re: Understanding Coalesce, Skewed Joins, and Why AQE Doesn't Always Intervene</title>
      <link>https://community.databricks.com/t5/community-articles/understanding-coalesce-skewed-joins-and-why-aqe-doesn-t-always/m-p/115815#M407</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/82205"&gt;@mark_ott&lt;/a&gt;&amp;nbsp;, this question seems right up your alley. Care to comment?&lt;/P&gt;</description>
      <pubDate>Thu, 17 Apr 2025 23:18:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/understanding-coalesce-skewed-joins-and-why-aqe-doesn-t-always/m-p/115815#M407</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-04-17T23:18:40Z</dc:date>
    </item>
  </channel>
</rss>

