Databricks Community

gmiguel · ‎10-13-2023

I've been doing some testing with Partitions vs Z-Ordering to optimize the merge process.
As the documentation says, tables smaller than 1TB should not be partitioned and can benefit from the Z-Ordering process to optimize the reading process.
Analyzing the Merge process, I identified that even after Z-Ordering, the destination table is always read in full to perform the join with the modified data. This means that, if there is a change in 1 record and the destination table is 100 GB, the merge process will read the 100 GB to identify the files that need to be rewritten, not using the statistics for file skipping.

This behavior seems weird to me, but that's what I figured out analyzing the merge execution plan.

The good and old-fashion partitioning still seems to be more suitable for merge processes.

gmiguel · ‎10-19-2023

I've found the answer I was looking for.

https://docs.databricks.com/en/optimizations/dynamic-file-pruning.html

Dynamic File Pruning works only for MERGE, UPDATE and DELETE when Photon is enabled.

Thank you

View solution in original post

gmiguel · ‎10-16-2023

@Retired_mod

I think you forgot to write down your thoughts...

Going further...

Is there any improvement in the roadmap to speed up the merge? It doesn't make sense having column statistics and still make use of only partition pruning to narrow down the data scanned. That's a lot of wasted computing resources, knowing that file skipping based on delta logs could be used in this matter.