Hi Databricks Community,
Iโm analyzing the performance of Delta Lake MERGE operations on a partitioned table, and I observed unexpected behavior across 3 test cases.
I wanted to share my findings to better understand:
Why ZORDER or Deletion Vectors helped or hurt performance
What factors explain the execution time behavior
What optimizations might improve performance in a partitioned table setup
Table Setup:
- Table is partitioned
- Merge condition targets only 17 specific rows
- All merges are on the same table โ only the optimization strategy changes
Case 1: Merge After Optimizing (No ZORDER, No Deletion Vectors)
Metric Value
numTargetRowsCopied 5.9 million
executionTimeMs 194,757 ms
rewriteTimeMs 51,049 ms
scanTimeMs 7,974 ms
numTargetFilesAdded 1
Observation: Reasonable performance but still rewrites a large portion of data even though only 17 rows were matched.
Case 2: Merge After ZORDER on 4 Columns (No Deletion Vectors)
Metric Value
numTargetRowsCopied 30.3 million
executionTimeMs 202,994 ms
rewriteTimeMs 113,906 ms
scanTimeMs 11,715 ms
numTargetFilesAdded 7
Observation: Worse performance than Case 1 โ even though ZORDER was applied, more rows were rewritten
Case 3: Merge After ZORDER + Deletion Vectors Enabled
Metric Value
numTargetRowsCopied 15 million
executionTimeMs 264,358 ms
rewriteTimeMs 122,688 ms
scanTimeMs 19,342 ms
numTargetFilesAdded 1
Observation: Despite enabling Deletion Vectors, which should help by avoiding full rewrites, the execution time actually increased.
- Why did enabling Deletion Vectors result in higher execution time, even though fewer files were touched and fewer rows were copied than in Case 2?
- Why does ZORDERing seem to increase the number of rows copied in Case 2, even though it improves scan locality?
- What are the best practices to ensure that ZORDER and Deletion Vectors actually improve performance, especially when updating a very small number of rows?
Would love to hear your insights and suggestions!
Thanks in advance!