cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Merge operation stuck on scanning files for matches

KacperG
New Contributor

Hi

I'm executing simple merge, however it always stucks at "MERGE operation - scanning files for matches". Both delta tables are not big - source has about 100MiB in 1 file and target has 1,5GiB, 7 files, so it should be quite fast operation, however it stuck infinitly on this:

KacperG_0-1729157109666.png

Both tables are external type and located on ADLS gen2. 
On the other hand using replace table using select * processes without problems.

3 REPLIES 3

LauJohansson
Contributor

Have you enabled liquid clustering on your tables? 

-werners-
Esteemed Contributor III

merging requires a lot more compute than an overwrite.
It has to check which files to replace.
If your data is very skewed, f.e. 60 or more percent of the data resides in a single file, that will be the bottleneck.
For merges to go fast one can use optimizations (like liquid clustering) or partition the table and do partition pruning while merging (basically an added where clause on the partition column).

Sourav-Kundu
New Contributor

Merge can take time due to various reasons.

You can try one or combination of the following options:

1. Use the OPTIMIZE command to compact small files into larger ones. This reduces the number of files that need to be read during the MERGE, improving performance.

2. You can use Liquid Clustering

3. If one of the tables in the MERGE operation is small, consider using broadcast joins

4. You can use Change Data Capture (CDC) to efficiently track and manage changes in data for MERGE operations. When using CDC, you can optimize your MERGE statements by only processing the changed records. This minimizes the volume of data scanned and processed, improving overall performance.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group