<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Performance Behavior of MERGE with Partitioned Table: Impact of ZORDER and Deletion Vectors in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/performance-behavior-of-merge-with-partitioned-table-impact-of/m-p/126448#M47696</link>
    <description>&lt;P&gt;Hi Databricks Community,&lt;/P&gt;&lt;P&gt;I’m analyzing the performance of Delta Lake MERGE operations on a partitioned table, and I observed unexpected behavior across 3 test cases.&lt;/P&gt;&lt;P&gt;I wanted to share my findings to better understand:&lt;/P&gt;&lt;P&gt;Why ZORDER or Deletion Vectors helped or hurt performance&lt;BR /&gt;What factors explain the execution time behavior&lt;BR /&gt;What optimizations might improve performance in a partitioned table setup&lt;/P&gt;&lt;DIV&gt;&lt;STRONG&gt;Table Setup:&lt;/STRONG&gt;&lt;/DIV&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Table is partitioned&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Merge condition targets only 17 specific rows&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;All merges are on the same table — only the optimization strategy changes&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;DIV&gt;&lt;STRONG&gt;Case 1&lt;/STRONG&gt;: Merge After Optimizing (No ZORDER, No Deletion Vectors)&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Metric Value&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetRowsCopied 5.9 million&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;executionTimeMs 194,757 ms&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;rewriteTimeMs 51,049 ms&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;scanTimeMs 7,974 ms&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetFilesAdded 1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Observation: Reasonable performance but still rewrites a large portion of data even though only 17 rows were matched.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;STRONG&gt;Case 2&lt;/STRONG&gt;: Merge After ZORDER on 4 Columns (No Deletion Vectors)&lt;BR /&gt;Metric Value&lt;BR /&gt;numTargetRowsCopied 30.3 million&lt;BR /&gt;executionTimeMs 202,994 ms&lt;BR /&gt;rewriteTimeMs 113,906 ms&lt;BR /&gt;scanTimeMs 11,715 ms&lt;BR /&gt;numTargetFilesAdded 7&lt;BR /&gt;&lt;STRONG&gt;Observation&lt;/STRONG&gt;: Worse performance than Case 1 — even though ZORDER was applied, more rows were rewritten&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;STRONG&gt;Case 3&lt;/STRONG&gt;: Merge After ZORDER + Deletion Vectors Enabled&lt;BR /&gt;Metric Value&lt;BR /&gt;numTargetRowsCopied 15 million&lt;BR /&gt;executionTimeMs 264,358 ms&lt;BR /&gt;rewriteTimeMs 122,688 ms&lt;BR /&gt;scanTimeMs 19,342 ms&lt;BR /&gt;numTargetFilesAdded 1&lt;BR /&gt;&lt;STRONG&gt;Observation&lt;/STRONG&gt;: Despite enabling Deletion Vectors, which should help by avoiding full rewrites, the execution time actually increased.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN&gt;Why did enabling Deletion Vectors result in higher execution time, even though fewer files were touched and fewer rows were copied than in Case 2?&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Why does ZORDERing seem to increase the number of rows copied in Case 2, even though it improves scan locality?&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;What are the best practices to ensure that ZORDER and Deletion Vectors actually improve performance, especially when updating a very small number of rows?&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Would love to hear your insights and suggestions!&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 25 Jul 2025 12:04:52 GMT</pubDate>
    <dc:creator>pooja_bhumandla</dc:creator>
    <dc:date>2025-07-25T12:04:52Z</dc:date>
    <item>
      <title>Performance Behavior of MERGE with Partitioned Table: Impact of ZORDER and Deletion Vectors</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-behavior-of-merge-with-partitioned-table-impact-of/m-p/126448#M47696</link>
      <description>&lt;P&gt;Hi Databricks Community,&lt;/P&gt;&lt;P&gt;I’m analyzing the performance of Delta Lake MERGE operations on a partitioned table, and I observed unexpected behavior across 3 test cases.&lt;/P&gt;&lt;P&gt;I wanted to share my findings to better understand:&lt;/P&gt;&lt;P&gt;Why ZORDER or Deletion Vectors helped or hurt performance&lt;BR /&gt;What factors explain the execution time behavior&lt;BR /&gt;What optimizations might improve performance in a partitioned table setup&lt;/P&gt;&lt;DIV&gt;&lt;STRONG&gt;Table Setup:&lt;/STRONG&gt;&lt;/DIV&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Table is partitioned&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Merge condition targets only 17 specific rows&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;All merges are on the same table — only the optimization strategy changes&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;DIV&gt;&lt;STRONG&gt;Case 1&lt;/STRONG&gt;: Merge After Optimizing (No ZORDER, No Deletion Vectors)&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Metric Value&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetRowsCopied 5.9 million&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;executionTimeMs 194,757 ms&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;rewriteTimeMs 51,049 ms&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;scanTimeMs 7,974 ms&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;numTargetFilesAdded 1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Observation: Reasonable performance but still rewrites a large portion of data even though only 17 rows were matched.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;STRONG&gt;Case 2&lt;/STRONG&gt;: Merge After ZORDER on 4 Columns (No Deletion Vectors)&lt;BR /&gt;Metric Value&lt;BR /&gt;numTargetRowsCopied 30.3 million&lt;BR /&gt;executionTimeMs 202,994 ms&lt;BR /&gt;rewriteTimeMs 113,906 ms&lt;BR /&gt;scanTimeMs 11,715 ms&lt;BR /&gt;numTargetFilesAdded 7&lt;BR /&gt;&lt;STRONG&gt;Observation&lt;/STRONG&gt;: Worse performance than Case 1 — even though ZORDER was applied, more rows were rewritten&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;STRONG&gt;Case 3&lt;/STRONG&gt;: Merge After ZORDER + Deletion Vectors Enabled&lt;BR /&gt;Metric Value&lt;BR /&gt;numTargetRowsCopied 15 million&lt;BR /&gt;executionTimeMs 264,358 ms&lt;BR /&gt;rewriteTimeMs 122,688 ms&lt;BR /&gt;scanTimeMs 19,342 ms&lt;BR /&gt;numTargetFilesAdded 1&lt;BR /&gt;&lt;STRONG&gt;Observation&lt;/STRONG&gt;: Despite enabling Deletion Vectors, which should help by avoiding full rewrites, the execution time actually increased.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN&gt;Why did enabling Deletion Vectors result in higher execution time, even though fewer files were touched and fewer rows were copied than in Case 2?&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Why does ZORDERing seem to increase the number of rows copied in Case 2, even though it improves scan locality?&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;What are the best practices to ensure that ZORDER and Deletion Vectors actually improve performance, especially when updating a very small number of rows?&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Would love to hear your insights and suggestions!&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Jul 2025 12:04:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-behavior-of-merge-with-partitioned-table-impact-of/m-p/126448#M47696</guid>
      <dc:creator>pooja_bhumandla</dc:creator>
      <dc:date>2025-07-25T12:04:52Z</dc:date>
    </item>
    <item>
      <title>Re: Performance Behavior of MERGE with Partitioned Table: Impact of ZORDER and Deletion Vectors</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-behavior-of-merge-with-partitioned-table-impact-of/m-p/127103#M47856</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170125"&gt;@pooja_bhumandla&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for such a nice and detailed description of Your case, that really helps to understand the scenario &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Regarding Your questions:&lt;/P&gt;&lt;P&gt;1)&lt;/P&gt;&lt;P class=""&gt;Overall operation could become more complex due to:&lt;/P&gt;&lt;P class=""&gt;a) deletion vector creation and maintenance,&lt;/P&gt;&lt;P class=""&gt;b) metadata operations,&lt;/P&gt;&lt;P class=""&gt;c) any background optimization processes.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2)&lt;/P&gt;&lt;P&gt;ZORDER operation is phisically redistributing Your rows across new files (creating new delta table version) created by OPTIMIZE command (thats why we can see 7 files rewritten in case 2) in a given order - this is why your scan operations are faster (improved reading capabilities on zorder-ed table).&lt;/P&gt;&lt;P&gt;Merge operation had to rewrite more files and more data so it took longer and downgraded overall performance of Your operation (your rows got redistributed across multiple parquet files).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;3)&lt;/P&gt;&lt;P&gt;I would recommend performing below checks / steps:&lt;/P&gt;&lt;P&gt;a) ZORDER keys are in line with merge patterns (keys),&lt;/P&gt;&lt;P&gt;b) use smaller target file sizes (using&amp;nbsp;&lt;SPAN&gt;tuneFileSizesForRewrites or&amp;nbsp;maxFileSize) - data layout under your delta table can have huge impact on merge performance (ref to this topic:&amp;nbsp;&lt;A href="https://community.databricks.com/t5/data-engineering/delta-lake-file-sizes-optimize-maxfilesize/td-p/127095" target="_self"&gt;delta-lake-file-sizes-optimize-maxfilesize)&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;c) consider running a test with liquid clustering or different partition alignment wich will fit your merge process,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;d) test on hive-style partitioned and optimized table with no zorder and different maxFileSize (32 MB, 64MB, 128 MB - depends of your table size) or&amp;nbsp;tuneFileSizesForRewrites set to true,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;e) test and adapt what works best for Your specific scenario.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Radek.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Jul 2025 14:36:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-behavior-of-merge-with-partitioned-table-impact-of/m-p/127103#M47856</guid>
      <dc:creator>radothede</dc:creator>
      <dc:date>2025-07-31T14:36:39Z</dc:date>
    </item>
  </channel>
</rss>

