cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Why is Merge with Deletion Vectors Slower Than Full File Rewrite on the Same Table?

pooja_bhumandla
New Contributor II
I've run two MERGE INTO operations on the same Delta tableโ€”one with Deletion Vectors enabled (Case 1), and one without (Case 2).
In Case 1 (with Deletion Vectors):  
executionTimeMs: 106,708  
materializeSourceTimeMs: 24,344 
numTargetRowsUpdated: 22  
numTargetDeletionVectorsAdded: 1
In Case 2 (no Deletion Vectors):  
executionTimeMs: 101,714  
materializeSourceTimeMs: 12,795  
numTargetRowsUpdated: 7  
numTargetRowsCopied: 405,967 (full rewrite)
I expected the DV-enabled merge to be faster, but it turned out to be slower overall. Both cases used the same unpartitioned table.
My questions:
1. Why is the merge with fewer updates and one deletion vector slower than a full rewrite?
2. What factors in DV overhead or source materialization might be contributing to this result?
3. Are there known cases where non-DV merges outperform DV-enabled ones on unpartitioned tables?
Any insights or experiences would be much appreciated
2 REPLIES 2

saurabh18cs
Honored Contributor

Hi Pooja

lets understand DV first -  This avoid rewriting entire files by marking rows as deleted/updated via a bitmap (the deletion vector), which should, in theory, be faster for small updates.

but DV introduces new overhead:

1) Writing and updating the DV metadata, and ensuring atomicity, adds I/O cost.

2) When DVs are present, Delta must read the original file and apply the DV mask at read time, which can slow down both the merge and subsequent reads.

3) additiional metadata handling using transaction logs. so we can think how dv involves updating parquet file footer, writing/updating dv file and also updating delta/transaction logs.

so my take is : Without partitioning, more data may need to be scanned and more DVs created, increasing overhead. Unless there is a real need of it for complex concurrency locking to avoid concurrent failure. Also. might be useful for large files with small updates but partitioned tables where rewriting would be expensive.

Always choose right strategy suited your workload.

 

 

 

szymon_dybczak
Esteemed Contributor III

Thanks for such detailed explanation @saurabh18cs !