- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-09-2026 04:25 PM
Hi Legends,
I have a timeseries DELTA table having 707.1GiB, 7702 files, 262 Billion rows. (Mainly its timeseries data). This table is clustered on 2 columns (Timestamp col & 2nd one is descriptive column)
I have designed a pipeline which runs every week and MERGEs source cdc data into above DELTA table.
However it takes ages to perform MERGE operation. For example i executed pipeline last night and its almost 16 Hours have passed and yet its not completed. M i doing anything wrong here?? And any suggestion how to make DELTA MERGE faster? 😞
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-09-2026 09:20 PM
Great question -- slow MERGE is one of the most common Delta Lake performance issues. Here's a systematic checklist:
1. Partition Pruning in the MERGE Condition
The #1 cause of slow MERGEs is missing the partition column in your ON clause. If your target table is partitioned by, say, date, your merge condition must include it:
MERGE INTO target t
USING source s
ON t.date = s.date AND t.id = s.id -- includes partition column
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
Without it, Spark scans every partition looking for matches -- even if the source data only touches a tiny slice.
2. Enable Low Shuffle Merge
Low Shuffle Merge (LSM) dramatically reduces the amount of data rewritten by only touching files that actually have changes:
SET spark.databricks.delta.merge.lowShuffle.enabled = true;
This avoids the classic problem where MERGE rewrites files even when no rows in them changed.
3. Z-ORDER on Join Columns
If your table isn't partitioned (or the merge key isn't the partition column), Z-ORDER on the columns used in your ON clause:
OPTIMIZE target_table ZORDER BY (id);
This co-locates related data in fewer files, so the merge only reads/rewrites a small number of files.
4. Right-size Your Source Data
If the source (incoming) data is small relative to the target, broadcast it:
from pyspark.sql.functions import broadcast
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100m")
If the source is large, consider breaking it into micro-batches by partition.
5. File Compaction
Too many small files in the target table forces the merge to open/scan thousands of files:
OPTIMIZE target_table;
Run this periodically. Check file count with:
DESCRIBE DETAIL target_table;
6. Liquid Clustering (Recommended for New Tables)
If you're on a recent DBR, consider migrating to Liquid Clustering instead of traditional partitioning + Z-ORDER. It automatically handles file layout:
ALTER TABLE target_table CLUSTER BY (id, date);
7. Check the Spark UI
Look at the SQL tab in the Spark UI for your MERGE job. Key things to check:
- Number of files scanned vs. number of files rewritten -- if these are close to the total file count, you're not pruning
- Shuffle read/write sizes -- large shuffles indicate the join is scanning too broadly
- Skew in task durations -- one slow task can bottleneck everything
Quick Diagnostic Query
DESCRIBE HISTORY target_table LIMIT 5;
Look at operationMetrics -- it shows numTargetFilesAdded, numTargetFilesRemoved, numTargetRowsUpdated, etc. If files removed is approximately equal to total files, you're rewriting the whole table on every merge.
References
- Why Your Delta Lake MERGE Takes Forever -- Databricks Community
- MERGE Always Slower Than CREATE OR REPLACE -- Databricks Community
- Merge Rewrites Many Unmodified Files -- Databricks Community
- Merge Operation Stuck on Scanning Files -- Databricks Community
- Your Databricks MERGE Is Probably Slow -- Here's the Fix (Medium)
Solutions Engineer @ Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-09-2026 09:55 PM
Thsnk @anuj_lathi , Should I enable Partitioning and Z-Ordering and move away from Clustering for the target table?