<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic DELTA Merge taking too much Time in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delta-merge-taking-too-much-time/m-p/154001#M54050</link>
    <description>&lt;P&gt;Hi Legends,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a timeseries DELTA table having&amp;nbsp;&lt;SPAN&gt;707.1GiB, 7702 files, 262 Billion rows. (Mainly its timeseries data). This table is clustered on 2 columns (Timestamp col &amp;amp; 2nd one is descriptive column)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I have designed a pipeline which runs every week and MERGEs source cdc data into above DELTA table. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However it takes ages to perform MERGE operation. For example i executed pipeline last night and its almost 16 Hours have passed and yet its not completed. M i doing anything wrong here?? And any suggestion how to make DELTA MERGE faster? &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Cheers&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 09 Apr 2026 23:25:43 GMT</pubDate>
    <dc:creator>FAHADURREHMAN</dc:creator>
    <dc:date>2026-04-09T23:25:43Z</dc:date>
    <item>
      <title>DELTA Merge taking too much Time</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-merge-taking-too-much-time/m-p/154001#M54050</link>
      <description>&lt;P&gt;Hi Legends,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a timeseries DELTA table having&amp;nbsp;&lt;SPAN&gt;707.1GiB, 7702 files, 262 Billion rows. (Mainly its timeseries data). This table is clustered on 2 columns (Timestamp col &amp;amp; 2nd one is descriptive column)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I have designed a pipeline which runs every week and MERGEs source cdc data into above DELTA table. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However it takes ages to perform MERGE operation. For example i executed pipeline last night and its almost 16 Hours have passed and yet its not completed. M i doing anything wrong here?? And any suggestion how to make DELTA MERGE faster? &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Cheers&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Apr 2026 23:25:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-merge-taking-too-much-time/m-p/154001#M54050</guid>
      <dc:creator>FAHADURREHMAN</dc:creator>
      <dc:date>2026-04-09T23:25:43Z</dc:date>
    </item>
    <item>
      <title>Re: DELTA Merge taking too much Time</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-merge-taking-too-much-time/m-p/154015#M54052</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Great question -- slow MERGE is one of the most common Delta Lake performance issues. Here's a systematic checklist:&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;1. Partition Pruning in the MERGE Condition&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The #1 cause of slow MERGEs is missing the partition column in your ON clause. If your target table is partitioned by, say, date, your merge condition &lt;/SPAN&gt;&lt;STRONG&gt;must&lt;/STRONG&gt;&lt;SPAN&gt; include it:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;MERGE INTO target t&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;USING source s&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;ON t.date = s.date AND t.id = s.id &amp;nbsp; -- includes partition column&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;WHEN MATCHED THEN UPDATE SET ...&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;WHEN NOT MATCHED THEN INSERT ...&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Without it, Spark scans &lt;/SPAN&gt;&lt;STRONG&gt;every partition&lt;/STRONG&gt;&lt;SPAN&gt; looking for matches -- even if the source data only touches a tiny slice.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;2. Enable Low Shuffle Merge&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;Low Shuffle Merge (LSM) dramatically reduces the amount of data rewritten by only touching files that actually have changes:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;SET spark.databricks.delta.merge.lowShuffle.enabled = true;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;This avoids the classic problem where MERGE rewrites files even when no rows in them changed.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;3. Z-ORDER on Join Columns&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If your table isn't partitioned (or the merge key isn't the partition column), Z-ORDER on the columns used in your ON clause:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;OPTIMIZE target_table ZORDER BY (id);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;This co-locates related data in fewer files, so the merge only reads/rewrites a small number of files.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;4. Right-size Your Source Data&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If the source (incoming) data is small relative to the target, &lt;/SPAN&gt;&lt;STRONG&gt;broadcast&lt;/STRONG&gt;&lt;SPAN&gt; it:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from pyspark.sql.functions import broadcast&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100m")&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If the source is large, consider breaking it into micro-batches by partition.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;5. File Compaction&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;Too many small files in the target table forces the merge to open/scan thousands of files:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;OPTIMIZE target_table;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Run this periodically. Check file count with:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;DESCRIBE DETAIL target_table;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;6. Liquid Clustering (Recommended for New Tables)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If you're on a recent DBR, consider migrating to &lt;/SPAN&gt;&lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt;&lt;SPAN&gt; instead of traditional partitioning + Z-ORDER. It automatically handles file layout:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;ALTER TABLE target_table CLUSTER BY (id, date);&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;7. Check the Spark UI&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;Look at the &lt;/SPAN&gt;&lt;STRONG&gt;SQL tab&lt;/STRONG&gt;&lt;SPAN&gt; in the Spark UI for your MERGE job. Key things to check:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Number of files scanned vs. number of files rewritten -- if these are close to the total file count, you're not pruning&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Shuffle read/write sizes -- large shuffles indicate the join is scanning too broadly&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;SPAN&gt;Skew in task durations -- one slow task can bottleneck everything&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;STRONG&gt;Quick Diagnostic Query&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;DESCRIBE HISTORY target_table LIMIT 5;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Look at &lt;/SPAN&gt;&lt;STRONG&gt;operationMetrics&lt;/STRONG&gt;&lt;SPAN&gt; -- it shows numTargetFilesAdded, numTargetFilesRemoved, numTargetRowsUpdated, etc. If files removed is approximately equal to total files, you're rewriting the whole table on every merge.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&lt;STRONG&gt;References&lt;/STRONG&gt;&lt;/H2&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://community.databricks.com/t5/community-articles/why-your-delta-lake-merge-takes-forever-and-how-to-fix-it/td-p/140636" target="_blank"&gt;&lt;SPAN&gt;Why Your Delta Lake MERGE Takes Forever -- Databricks Community&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://community.databricks.com/t5/data-engineering/quot-merge-quot-always-slower-than-quot-create-or-replace-quot/td-p/31513" target="_blank"&gt;&lt;SPAN&gt;MERGE Always Slower Than CREATE OR REPLACE -- Databricks Community&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://community.databricks.com/t5/data-engineering/merge-rewrites-many-unmodified-files/td-p/111689" target="_blank"&gt;&lt;SPAN&gt;Merge Rewrites Many Unmodified Files -- Databricks Community&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://community.databricks.com/t5/data-engineering/merge-operation-stuck-on-scanning-files-for-matches/td-p/94463" target="_blank"&gt;&lt;SPAN&gt;Merge Operation Stuck on Scanning Files -- Databricks Community&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://medium.com/@joannawu_70798/your-databricks-merge-is-probably-slow-heres-the-fix-700be0abfc75" target="_blank"&gt;&lt;SPAN&gt;Your Databricks MERGE Is Probably Slow -- Here's the Fix (Medium)&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 10 Apr 2026 04:20:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-merge-taking-too-much-time/m-p/154015#M54052</guid>
      <dc:creator>anuj_lathi</dc:creator>
      <dc:date>2026-04-10T04:20:08Z</dc:date>
    </item>
    <item>
      <title>Re: DELTA Merge taking too much Time</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-merge-taking-too-much-time/m-p/154027#M54056</link>
      <description>&lt;P&gt;Thsnk &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/182781"&gt;@anuj_lathi&lt;/a&gt;&amp;nbsp;, Should I enable Partitioning and Z-Ordering and move away from Clustering for the target table?&lt;/P&gt;</description>
      <pubDate>Fri, 10 Apr 2026 04:55:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-merge-taking-too-much-time/m-p/154027#M54056</guid>
      <dc:creator>FAHADURREHMAN</dc:creator>
      <dc:date>2026-04-10T04:55:22Z</dc:date>
    </item>
  </channel>
</rss>

