<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Auto OPTIMIZE causing a data discrepancy in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/auto-optimize-causing-a-data-discrepancy/m-p/96387#M39270</link>
    <description>&lt;P&gt;I have a delta table in Azure Databricks that gets&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;MERGE&lt;/STRONG&gt;d every 10 minutes.&lt;/P&gt;&lt;P&gt;In the attached screenshot,&amp;nbsp;in the version history of this table, I see a &lt;STRONG&gt;MERGE&lt;/STRONG&gt; operation every 10 minutes which is expected. Along with that, I see the &lt;STRONG&gt;OPTIMIZE&lt;/STRONG&gt; operation after every 3-4 runs, &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/optimizations/predictive-optimization" target="_self"&gt;this article&lt;/A&gt; suggests that Databricks does the &lt;STRONG&gt;predictive optimization&lt;/STRONG&gt; for the tables, and that is the reason that I see those &lt;STRONG&gt;OPTIMIZE&lt;/STRONG&gt; operations.&lt;/P&gt;&lt;P&gt;My question: When I run a simple SELECT statement on this table with some basic WHERE clause, I randomly see data discrepency (700 records than the expected of 2000+ records). To dig dipper into this, I noted the time when I ran the query and got fewer records and then I queried the history of the table for that time. It seems like when it performs the OPTIMIZE it is combining the smaller files to form a bigger file but I am getting caught in the middle and there's a discrepancy which defeats the ACID properties that Databricks offers.&lt;/P&gt;&lt;P&gt;This is the operation metrics of the OPTIMIZE operation:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;numRemovedFiles: "2"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numRemovedBytes: "5644151"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;p25FileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numDeletionVectorsRemoved: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;minFileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numAddedFiles: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;maxFileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;p75FileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;p50FileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numAddedBytes: "3234772"&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;and this is the operation metrics of the MERGE operation that ran 10 seconds earlier:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;numTargetRowsCopied: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsDeleted: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetFilesAdded: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetBytesAdded: "2409613"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetBytesRemoved: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetDeletionVectorsAdded: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsMatchedUpdated: "20380"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;executionTimeMs: "22332"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;materializeSourceTimeMs: "5886"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsInserted: "6"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsMatchedDeleted: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetDeletionVectorsUpdated: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;scanTimeMs: "1781"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsUpdated: "20380"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numOutputRows: "20386"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetDeletionVectorsRemoved: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsNotMatchedBySourceUpdated: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetChangeFilesAdded: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numSourceRows: "20386"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetFilesRemoved: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsNotMatchedBySourceDeleted: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;rewriteTimeMs: "14510"&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;What could be the reason for this data discrepency and is there any setting that I am missing in these operations?&lt;/P&gt;</description>
    <pubDate>Mon, 28 Oct 2024 07:16:48 GMT</pubDate>
    <dc:creator>sanket-kelkar</dc:creator>
    <dc:date>2024-10-28T07:16:48Z</dc:date>
    <item>
      <title>Auto OPTIMIZE causing a data discrepancy</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-optimize-causing-a-data-discrepancy/m-p/96387#M39270</link>
      <description>&lt;P&gt;I have a delta table in Azure Databricks that gets&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;MERGE&lt;/STRONG&gt;d every 10 minutes.&lt;/P&gt;&lt;P&gt;In the attached screenshot,&amp;nbsp;in the version history of this table, I see a &lt;STRONG&gt;MERGE&lt;/STRONG&gt; operation every 10 minutes which is expected. Along with that, I see the &lt;STRONG&gt;OPTIMIZE&lt;/STRONG&gt; operation after every 3-4 runs, &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/optimizations/predictive-optimization" target="_self"&gt;this article&lt;/A&gt; suggests that Databricks does the &lt;STRONG&gt;predictive optimization&lt;/STRONG&gt; for the tables, and that is the reason that I see those &lt;STRONG&gt;OPTIMIZE&lt;/STRONG&gt; operations.&lt;/P&gt;&lt;P&gt;My question: When I run a simple SELECT statement on this table with some basic WHERE clause, I randomly see data discrepency (700 records than the expected of 2000+ records). To dig dipper into this, I noted the time when I ran the query and got fewer records and then I queried the history of the table for that time. It seems like when it performs the OPTIMIZE it is combining the smaller files to form a bigger file but I am getting caught in the middle and there's a discrepancy which defeats the ACID properties that Databricks offers.&lt;/P&gt;&lt;P&gt;This is the operation metrics of the OPTIMIZE operation:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;numRemovedFiles: "2"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numRemovedBytes: "5644151"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;p25FileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numDeletionVectorsRemoved: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;minFileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numAddedFiles: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;maxFileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;p75FileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;p50FileSize: "3234772"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numAddedBytes: "3234772"&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;and this is the operation metrics of the MERGE operation that ran 10 seconds earlier:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;numTargetRowsCopied: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsDeleted: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetFilesAdded: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetBytesAdded: "2409613"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetBytesRemoved: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetDeletionVectorsAdded: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsMatchedUpdated: "20380"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;executionTimeMs: "22332"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;materializeSourceTimeMs: "5886"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsInserted: "6"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsMatchedDeleted: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetDeletionVectorsUpdated: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;scanTimeMs: "1781"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsUpdated: "20380"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numOutputRows: "20386"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetDeletionVectorsRemoved: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsNotMatchedBySourceUpdated: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetChangeFilesAdded: "1"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numSourceRows: "20386"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetFilesRemoved: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;numTargetRowsNotMatchedBySourceDeleted: "0"&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;rewriteTimeMs: "14510"&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;What could be the reason for this data discrepency and is there any setting that I am missing in these operations?&lt;/P&gt;</description>
      <pubDate>Mon, 28 Oct 2024 07:16:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-optimize-causing-a-data-discrepancy/m-p/96387#M39270</guid>
      <dc:creator>sanket-kelkar</dc:creator>
      <dc:date>2024-10-28T07:16:48Z</dc:date>
    </item>
    <item>
      <title>Re: Auto OPTIMIZE causing a data discrepancy</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-optimize-causing-a-data-discrepancy/m-p/97041#M39403</link>
      <description>&lt;P&gt;Can you please provide more context about this, specifically with respect to the DBR Release and reproducibility of this scenario?&lt;/P&gt;
&lt;P&gt;Any metrics or plan change differences between both select statements, while the Optimize was in progress and after?&lt;/P&gt;
&lt;P&gt;This can be better triaged via a support ticket.&lt;/P&gt;</description>
      <pubDate>Thu, 31 Oct 2024 17:01:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-optimize-causing-a-data-discrepancy/m-p/97041#M39403</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2024-10-31T17:01:41Z</dc:date>
    </item>
  </channel>
</rss>

