Databricks Community

pooja_bhumandla · ‎06-30-2025

I've run two MERGE INTO operations on the same Delta table—one with Deletion Vectors enabled (Case 1), and one without (Case 2).

In Case 1 (with Deletion Vectors):

executionTimeMs: 106,708

materializeSourceTimeMs: 24,344

numTargetRowsUpdated: 22

numTargetDeletionVectorsAdded: 1

In Case 2 (no Deletion Vectors):

executionTimeMs: 101,714

materializeSourceTimeMs: 12,795

numTargetRowsUpdated: 7

numTargetRowsCopied: 405,967 (full rewrite)

I expected the DV-enabled merge to be faster, but it turned out to be slower overall. Both cases used the same unpartitioned table.

My questions:

1. Why is the merge with fewer updates and one deletion vector slower than a full rewrite?

2. What factors in DV overhead or source materialization might be contributing to this result?

3. Are there known cases where non-DV merges outperform DV-enabled ones on unpartitioned tables?

Any insights or experiences would be much appreciated

saurabh18cs · ‎06-30-2025

Hi Pooja

lets understand DV first - This avoid rewriting entire files by marking rows as deleted/updated via a bitmap (the deletion vector), which should, in theory, be faster for small updates.

but DV introduces new overhead:

1) Writing and updating the DV metadata, and ensuring atomicity, adds I/O cost.

2) When DVs are present, Delta must read the original file and apply the DV mask at read time, which can slow down both the merge and subsequent reads.

3) additiional metadata handling using transaction logs. so we can think how dv involves updating parquet file footer, writing/updating dv file and also updating delta/transaction logs.

so my take is : Without partitioning, more data may need to be scanned and more DVs created, increasing overhead. Unless there is a real need of it for complex concurrency locking to avoid concurrent failure. Also. might be useful for large files with small updates but partitioned tables where rewriting would be expensive.

Always choose right strategy suited your workload.