cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
cancel
Showing results for 
Search instead for 
Did you mean: 

Does "Merge Into" skip files when reading target table to find files to be touched?

gmiguel
New Contributor III

I've been doing some testing with Partitions vs Z-Ordering to optimize the merge process.
As the documentation says, tables smaller than 1TB should not be partitioned and can benefit from the Z-Ordering process to optimize the reading process.
Analyzing the Merge process, I identified that even after Z-Ordering, the destination table is always read in full to perform the join with the modified data. This means that, if there is a change in 1 record and the destination table is 100 GB, the merge process will read the 100 GB to identify the files that need to be rewritten, not using the statistics for file skipping.

This behavior seems weird to me, but that's what I figured out analyzing the merge execution plan.

The good and old-fashion partitioning still seems to be more suitable for merge processes.

 

1 ACCEPTED SOLUTION

Accepted Solutions

gmiguel
New Contributor III

I've found the answer I was looking for.

https://docs.databricks.com/en/optimizations/dynamic-file-pruning.html

Dynamic File Pruning works only for MERGE, UPDATE and DELETE when Photon is enabled.

Thank you

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @gmiguel

I've been testing with Partitions vs. Z-Ordering to optimize the merge process.


As the documentation says, tables smaller than 1TB should not be partitioned and can benefit from the Z-Ordering process to optimize the reading process.
Analyzing the Merge process, I identified that even after Z-Ordering, the destination table is always read in full to perform the join with the modified data. If there is a change in 1 record and the destination table is 100 GB, the merge process will read the 100 GB to identify the files that need to be rewritten, not using the statistics for file skipping.

This behaviour seems weird to me, but that's what I figured out by analyzing the merge execution plan.

The excellent and old-fashioned partitioning still seems more suitable for merge processes.

gmiguel
New Contributor III

@Kaniz 

I think you forgot to write down your thoughts...

Going further...

Is there any improvement in the roadmap to speed up the merge? It doesn't make sense having column statistics and still make use of only partition pruning to narrow down the data scanned. That's a lot of wasted computing resources, knowing that file skipping based on delta logs could be used in this matter.

gmiguel
New Contributor III

I've found the answer I was looking for.

https://docs.databricks.com/en/optimizations/dynamic-file-pruning.html

Dynamic File Pruning works only for MERGE, UPDATE and DELETE when Photon is enabled.

Thank you

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.