<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Why is Merge with Deletion Vectors Slower Than Full File Rewrite on the Same Table? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/why-is-merge-with-deletion-vectors-slower-than-full-file-rewrite/m-p/123246#M46965</link>
    <description>&lt;DIV&gt;I've run two MERGE INTO operations on the same Delta table—one with Deletion Vectors enabled (Case 1), and one without (Case 2).&lt;/DIV&gt;&lt;DIV&gt;In Case 1 (with Deletion Vectors):&amp;nbsp;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;executionTimeMs: 106,708&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;materializeSourceTimeMs: 24,344&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetRowsUpdated: 22&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetDeletionVectorsAdded: 1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;In Case 2 (no Deletion Vectors):&amp;nbsp;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;executionTimeMs: 101,714&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;materializeSourceTimeMs: 12,795&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetRowsUpdated: 7&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetRowsCopied: 405,967 (full rewrite) &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;I expected the DV-enabled merge to be faster, but it turned out to be slower overall. Both cases used the same unpartitioned table.&lt;/DIV&gt;&lt;DIV&gt;My questions:&lt;/DIV&gt;&lt;DIV&gt;1. Why is the merge with fewer updates and one deletion vector slower than a full rewrite?&lt;/DIV&gt;&lt;DIV&gt;2. What factors in DV overhead or source materialization might be contributing to this result?&lt;/DIV&gt;&lt;DIV&gt;3. Are there known cases where non-DV merges outperform DV-enabled ones on unpartitioned tables?&lt;/DIV&gt;&lt;DIV&gt;Any insights or experiences would be much appreciated&lt;/DIV&gt;</description>
    <pubDate>Mon, 30 Jun 2025 08:24:52 GMT</pubDate>
    <dc:creator>pooja_bhumandla</dc:creator>
    <dc:date>2025-06-30T08:24:52Z</dc:date>
    <item>
      <title>Why is Merge with Deletion Vectors Slower Than Full File Rewrite on the Same Table?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-merge-with-deletion-vectors-slower-than-full-file-rewrite/m-p/123246#M46965</link>
      <description>&lt;DIV&gt;I've run two MERGE INTO operations on the same Delta table—one with Deletion Vectors enabled (Case 1), and one without (Case 2).&lt;/DIV&gt;&lt;DIV&gt;In Case 1 (with Deletion Vectors):&amp;nbsp;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;executionTimeMs: 106,708&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;materializeSourceTimeMs: 24,344&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetRowsUpdated: 22&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetDeletionVectorsAdded: 1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;In Case 2 (no Deletion Vectors):&amp;nbsp;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;executionTimeMs: 101,714&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;materializeSourceTimeMs: 12,795&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetRowsUpdated: 7&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetRowsCopied: 405,967 (full rewrite) &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;I expected the DV-enabled merge to be faster, but it turned out to be slower overall. Both cases used the same unpartitioned table.&lt;/DIV&gt;&lt;DIV&gt;My questions:&lt;/DIV&gt;&lt;DIV&gt;1. Why is the merge with fewer updates and one deletion vector slower than a full rewrite?&lt;/DIV&gt;&lt;DIV&gt;2. What factors in DV overhead or source materialization might be contributing to this result?&lt;/DIV&gt;&lt;DIV&gt;3. Are there known cases where non-DV merges outperform DV-enabled ones on unpartitioned tables?&lt;/DIV&gt;&lt;DIV&gt;Any insights or experiences would be much appreciated&lt;/DIV&gt;</description>
      <pubDate>Mon, 30 Jun 2025 08:24:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-merge-with-deletion-vectors-slower-than-full-file-rewrite/m-p/123246#M46965</guid>
      <dc:creator>pooja_bhumandla</dc:creator>
      <dc:date>2025-06-30T08:24:52Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Merge with Deletion Vectors Slower Than Full File Rewrite on the Same Table?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-merge-with-deletion-vectors-slower-than-full-file-rewrite/m-p/123255#M46966</link>
      <description>&lt;P&gt;Hi Pooja&lt;/P&gt;&lt;P&gt;lets understand DV first -&amp;nbsp; This avoid rewriting entire files by marking rows as deleted/updated via a bitmap (the deletion vector), which should, in theory, be faster for small updates.&lt;/P&gt;&lt;P&gt;but DV introduces new overhead:&lt;/P&gt;&lt;P&gt;1)&amp;nbsp;&lt;SPAN&gt;Writing and updating the DV metadata, and ensuring atomicity, adds I/O cost.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2)&amp;nbsp;When DVs are present, Delta must read the original file and apply the DV mask at read time, which can slow down both the merge and subsequent reads.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;3) additiional metadata handling using transaction logs. so we can think how dv involves updating parquet file footer, writing/updating dv file and also updating delta/transaction logs.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;so my take is :&amp;nbsp;&lt;/SPAN&gt;Without partitioning, more data may need to be scanned and more DVs created, increasing overhead. Unless there is a real need of it for complex concurrency locking to avoid concurrent failure. Also. might be useful for large files with small updates but partitioned tables where rewriting would be expensive.&lt;/P&gt;&lt;P&gt;Always choose right strategy suited your workload.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jun 2025 09:25:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-merge-with-deletion-vectors-slower-than-full-file-rewrite/m-p/123255#M46966</guid>
      <dc:creator>saurabh18cs</dc:creator>
      <dc:date>2025-06-30T09:25:34Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Merge with Deletion Vectors Slower Than Full File Rewrite on the Same Table?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-merge-with-deletion-vectors-slower-than-full-file-rewrite/m-p/123260#M46968</link>
      <description>&lt;P&gt;Thanks for such detailed explanation&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/22314"&gt;@saurabh18cs&lt;/a&gt;&amp;nbsp;!&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jun 2025 09:59:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-merge-with-deletion-vectors-slower-than-full-file-rewrite/m-p/123260#M46968</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-06-30T09:59:28Z</dc:date>
    </item>
  </channel>
</rss>

