Databricks Community

FAHADURREHMAN · ‎04-09-2026

Hi Legends,

I have a timeseries DELTA table having 707.1GiB, 7702 files, 262 Billion rows. (Mainly its timeseries data). This table is clustered on 2 columns (Timestamp col & 2nd one is descriptive column)

I have designed a pipeline which runs every week and MERGEs source cdc data into above DELTA table.

However it takes ages to perform MERGE operation. For example i executed pipeline last night and its almost 16 Hours have passed and yet its not completed. M i doing anything wrong here?? And any suggestion how to make DELTA MERGE faster? 😞

Cheers

anuj_lathi · ‎04-09-2026

Great question -- slow MERGE is one of the most common Delta Lake performance issues. Here's a systematic checklist:

1. Partition Pruning in the MERGE Condition

The #1 cause of slow MERGEs is missing the partition column in your ON clause. If your target table is partitioned by, say, date, your merge condition must include it:

MERGE INTO target t

USING source s

ON t.date = s.date AND t.id = s.id -- includes partition column

WHEN MATCHED THEN UPDATE SET ...

WHEN NOT MATCHED THEN INSERT ...

Without it, Spark scans every partition looking for matches -- even if the source data only touches a tiny slice.

2. Enable Low Shuffle Merge

Low Shuffle Merge (LSM) dramatically reduces the amount of data rewritten by only touching files that actually have changes:

SET spark.databricks.delta.merge.lowShuffle.enabled = true;

This avoids the classic problem where MERGE rewrites files even when no rows in them changed.

3. Z-ORDER on Join Columns

If your table isn't partitioned (or the merge key isn't the partition column), Z-ORDER on the columns used in your ON clause:

OPTIMIZE target_table ZORDER BY (id);

This co-locates related data in fewer files, so the merge only reads/rewrites a small number of files.

4. Right-size Your Source Data

If the source (incoming) data is small relative to the target, broadcast it:

from pyspark.sql.functions import broadcast

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100m")

If the source is large, consider breaking it into micro-batches by partition.

5. File Compaction

Too many small files in the target table forces the merge to open/scan thousands of files:

OPTIMIZE target_table;

Run this periodically. Check file count with:

DESCRIBE DETAIL target_table;

6. Liquid Clustering (Recommended for New Tables)

If you're on a recent DBR, consider migrating to Liquid Clustering instead of traditional partitioning + Z-ORDER. It automatically handles file layout:

ALTER TABLE target_table CLUSTER BY (id, date);

7. Check the Spark UI

Look at the SQL tab in the Spark UI for your MERGE job. Key things to check:

Number of files scanned vs. number of files rewritten -- if these are close to the total file count, you're not pruning
Shuffle read/write sizes -- large shuffles indicate the join is scanning too broadly
Skew in task durations -- one slow task can bottleneck everything

Quick Diagnostic Query

DESCRIBE HISTORY target_table LIMIT 5;

Look at operationMetrics -- it shows numTargetFilesAdded, numTargetFilesRemoved, numTargetRowsUpdated, etc. If files removed is approximately equal to total files, you're rewriting the whole table on every merge.

References

Anuj Lathi
Solutions Engineer @ Databricks

View solution in original post

anuj_lathi · ‎04-09-2026

Great question -- slow MERGE is one of the most common Delta Lake performance issues. Here's a systematic checklist:

1. Partition Pruning in the MERGE Condition

The #1 cause of slow MERGEs is missing the partition column in your ON clause. If your target table is partitioned by, say, date, your merge condition must include it:

MERGE INTO target t

USING source s

ON t.date = s.date AND t.id = s.id -- includes partition column

WHEN MATCHED THEN UPDATE SET ...

WHEN NOT MATCHED THEN INSERT ...

Without it, Spark scans every partition looking for matches -- even if the source data only touches a tiny slice.

2. Enable Low Shuffle Merge

Low Shuffle Merge (LSM) dramatically reduces the amount of data rewritten by only touching files that actually have changes:

SET spark.databricks.delta.merge.lowShuffle.enabled = true;

This avoids the classic problem where MERGE rewrites files even when no rows in them changed.

3. Z-ORDER on Join Columns

If your table isn't partitioned (or the merge key isn't the partition column), Z-ORDER on the columns used in your ON clause:

OPTIMIZE target_table ZORDER BY (id);

This co-locates related data in fewer files, so the merge only reads/rewrites a small number of files.

4. Right-size Your Source Data

If the source (incoming) data is small relative to the target, broadcast it:

from pyspark.sql.functions import broadcast

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100m")

If the source is large, consider breaking it into micro-batches by partition.

5. File Compaction

Too many small files in the target table forces the merge to open/scan thousands of files:

OPTIMIZE target_table;

Run this periodically. Check file count with:

DESCRIBE DETAIL target_table;

6. Liquid Clustering (Recommended for New Tables)

If you're on a recent DBR, consider migrating to Liquid Clustering instead of traditional partitioning + Z-ORDER. It automatically handles file layout:

ALTER TABLE target_table CLUSTER BY (id, date);

7. Check the Spark UI

Look at the SQL tab in the Spark UI for your MERGE job. Key things to check:

Number of files scanned vs. number of files rewritten -- if these are close to the total file count, you're not pruning
Shuffle read/write sizes -- large shuffles indicate the join is scanning too broadly
Skew in task durations -- one slow task can bottleneck everything

Quick Diagnostic Query

DESCRIBE HISTORY target_table LIMIT 5;

Look at operationMetrics -- it shows numTargetFilesAdded, numTargetFilesRemoved, numTargetRowsUpdated, etc. If files removed is approximately equal to total files, you're rewriting the whole table on every merge.

References

Anuj Lathi
Solutions Engineer @ Databricks

FAHADURREHMAN · ‎04-09-2026

Thsnk @anuj_lathi , Should I enable Partitioning and Z-Ordering and move away from Clustering for the target table?

Databricks Community

DELTA Merge taking too much Time

1. Partition Pruning in the MERGE Condition

2. Enable Low Shuffle Merge

3. Z-ORDER on Join Columns

4. Right-size Your Source Data

5. File Compaction

6. Liquid Clustering (Recommended for New Tables)

7. Check the Spark UI

Quick Diagnostic Query

References

1. Partition Pruning in the MERGE Condition

2. Enable Low Shuffle Merge

3. Z-ORDER on Join Columns

4. Right-size Your Source Data

5. File Compaction

6. Liquid Clustering (Recommended for New Tables)

7. Check the Spark UI

Quick Diagnostic Query

References

DAIS 2026 | Day 3 Recap: That's a wrap. Empty boxes & full hearts.

‌✨‌ DAIS 2026 Community Virtual Contest – Winners Announced! 🏆

🌟 Community Pulse: Your Weekly Roundup! June 08 – 14, 2026

Solution Accelerator Series | Building a Chatbot With Large Language Models (LLMs)

Build apps without jumping through hoops