I have a delta table in Azure Databricks that gets MERGEd every 10 minutes.
In the attached screenshot, in the version history of this table, I see a MERGE operation every 10 minutes which is expected. Along with that, I see the OPTIMIZE operation after every 3-4 runs, this article suggests that Databricks does the predictive optimization for the tables, and that is the reason that I see those OPTIMIZE operations.
My question: When I run a simple SELECT statement on this table with some basic WHERE clause, I randomly see data discrepency (700 records than the expected of 2000+ records). To dig dipper into this, I noted the time when I ran the query and got fewer records and then I queried the history of the table for that time. It seems like when it performs the OPTIMIZE it is combining the smaller files to form a bigger file but I am getting caught in the middle and there's a discrepancy which defeats the ACID properties that Databricks offers.
This is the operation metrics of the OPTIMIZE operation:
numRemovedFiles: "2"
numRemovedBytes: "5644151"
p25FileSize: "3234772"
numDeletionVectorsRemoved: "1"
minFileSize: "3234772"
numAddedFiles: "1"
maxFileSize: "3234772"
p75FileSize: "3234772"
p50FileSize: "3234772"
numAddedBytes: "3234772"
and this is the operation metrics of the MERGE operation that ran 10 seconds earlier:
numTargetRowsCopied: "0"
numTargetRowsDeleted: "0"
numTargetFilesAdded: "1"
numTargetBytesAdded: "2409613"
numTargetBytesRemoved: "0"
numTargetDeletionVectorsAdded: "1"
numTargetRowsMatchedUpdated: "20380"
executionTimeMs: "22332"
materializeSourceTimeMs: "5886"
numTargetRowsInserted: "6"
numTargetRowsMatchedDeleted: "0"
numTargetDeletionVectorsUpdated: "0"
scanTimeMs: "1781"
numTargetRowsUpdated: "20380"
numOutputRows: "20386"
numTargetDeletionVectorsRemoved: "0"
numTargetRowsNotMatchedBySourceUpdated: "0"
numTargetChangeFilesAdded: "1"
numSourceRows: "20386"
numTargetFilesRemoved: "0"
numTargetRowsNotMatchedBySourceDeleted: "0"
rewriteTimeMs: "14510"
What could be the reason for this data discrepency and is there any setting that I am missing in these operations?