Auto OPTIMIZE causing a data discrepancy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2024 12:13 AM - edited 10-28-2024 12:16 AM
I have a delta table in Azure Databricks that gets MERGEd every 10 minutes.
In the attached screenshot, in the version history of this table, I see a MERGE operation every 10 minutes which is expected. Along with that, I see the OPTIMIZE operation after every 3-4 runs, this article suggests that Databricks does the predictive optimization for the tables, and that is the reason that I see those OPTIMIZE operations.
My question: When I run a simple SELECT statement on this table with some basic WHERE clause, I randomly see data discrepency (700 records than the expected of 2000+ records). To dig dipper into this, I noted the time when I ran the query and got fewer records and then I queried the history of the table for that time. It seems like when it performs the OPTIMIZE it is combining the smaller files to form a bigger file but I am getting caught in the middle and there's a discrepancy which defeats the ACID properties that Databricks offers.
This is the operation metrics of the OPTIMIZE operation:
numRemovedFiles: "2"
numRemovedBytes: "5644151"
p25FileSize: "3234772"
numDeletionVectorsRemoved: "1"
minFileSize: "3234772"
numAddedFiles: "1"
maxFileSize: "3234772"
p75FileSize: "3234772"
p50FileSize: "3234772"
numAddedBytes: "3234772"
and this is the operation metrics of the MERGE operation that ran 10 seconds earlier:
numTargetRowsCopied: "0"
numTargetRowsDeleted: "0"
numTargetFilesAdded: "1"
numTargetBytesAdded: "2409613"
numTargetBytesRemoved: "0"
numTargetDeletionVectorsAdded: "1"
numTargetRowsMatchedUpdated: "20380"
executionTimeMs: "22332"
materializeSourceTimeMs: "5886"
numTargetRowsInserted: "6"
numTargetRowsMatchedDeleted: "0"
numTargetDeletionVectorsUpdated: "0"
scanTimeMs: "1781"
numTargetRowsUpdated: "20380"
numOutputRows: "20386"
numTargetDeletionVectorsRemoved: "0"
numTargetRowsNotMatchedBySourceUpdated: "0"
numTargetChangeFilesAdded: "1"
numSourceRows: "20386"
numTargetFilesRemoved: "0"
numTargetRowsNotMatchedBySourceDeleted: "0"
rewriteTimeMs: "14510"
What could be the reason for this data discrepency and is there any setting that I am missing in these operations?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-31-2024 10:01 AM
Can you please provide more context about this, specifically with respect to the DBR Release and reproducibility of this scenario?
Any metrics or plan change differences between both select statements, while the Optimize was in progress and after?
This can be better triaged via a support ticket.