Databricks Community

KacperG · ‎10-17-2024

Hi

I'm executing simple merge, however it always stucks at "MERGE operation - scanning files for matches". Both delta tables are not big - source has about 100MiB in 1 file and target has 1,5GiB, 7 files, so it should be quite fast operation, however it stuck infinitly on this:

Both tables are external type and located on ADLS gen2.
On the other hand using replace table using select * processes without problems.

KacperG · ‎01-14-2025

Well, in the end, it was caused by skewed data. Document_ID was -1 for returns in sales, so a big part of the table was filled with -1 values. Adding an extra column to the merger solved the problem.

This article helped me a lot:

https://www.databricks.com/discover/pages/optimize-data-workloads-guide#data-skewness

View solution in original post

LauJohansson · ‎10-18-2024

Have you enabled liquid clustering on your tables?

-werners- · ‎10-18-2024

merging requires a lot more compute than an overwrite.
It has to check which files to replace.
If your data is very skewed, f.e. 60 or more percent of the data resides in a single file, that will be the bottleneck.
For merges to go fast one can use optimizations (like liquid clustering) or partition the table and do partition pruning while merging (basically an added where clause on the partition column).

Sourav-Kundu · ‎10-19-2024

Merge can take time due to various reasons.

You can try one or combination of the following options:

1. Use the OPTIMIZE command to compact small files into larger ones. This reduces the number of files that need to be read during the MERGE, improving performance.

2. You can use Liquid Clustering

3. If one of the tables in the MERGE operation is small, consider using broadcast joins

4. You can use Change Data Capture (CDC) to efficiently track and manage changes in data for MERGE operations. When using CDC, you can optimize your MERGE statements by only processing the changed records. This minimizes the volume of data scanned and processed, improving overall performance.

Pablo_Camacho · ‎11-09-2024

Hello @Sourav-Kundu,

Could you kindly provide more detail on your third point:

- *"If one of the tables in the MERGE operation is small, consider using broadcast joins."*

I’m interested in understanding how to apply a BROADCAST hint (/*+ BROADCAST(table) */) within a MERGE statement. I've tried a few methods without success.

I'm currently encountering a similar issue as @KacperG. My straightforward merge statement seems to be stalled. The target table utilizes liquid clustering, and the source table is loaded into memory using `createOrReplaceTempView` before executing the merge. This situation is quite challenging, and I'm running out of options. Your input would be highly appreciated. Thank you!

KacperG · ‎01-14-2025

Well, in the end, it was caused by skewed data. Document_ID was -1 for returns in sales, so a big part of the table was filled with -1 values. Adding an extra column to the merger solved the problem.

This article helped me a lot:

https://www.databricks.com/discover/pages/optimize-data-workloads-guide#data-skewness

Databricks Community

Merge operation stuck on scanning files for matches

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! December 12 – 21, 2025

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

Celebrating Our First Brickster Champion: Louis Frolio