cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Auto OPTIMIZE causing a data discrepancy

sanket-kelkar
New Contributor II

I have a delta table in Azure Databricks that gets MERGEd every 10 minutes.

In the attached screenshot, in the version history of this table, I see a MERGE operation every 10 minutes which is expected. Along with that, I see the OPTIMIZE operation after every 3-4 runs, this article suggests that Databricks does the predictive optimization for the tables, and that is the reason that I see those OPTIMIZE operations.

My question: When I run a simple SELECT statement on this table with some basic WHERE clause, I randomly see data discrepency (700 records than the expected of 2000+ records). To dig dipper into this, I noted the time when I ran the query and got fewer records and then I queried the history of the table for that time. It seems like when it performs the OPTIMIZE it is combining the smaller files to form a bigger file but I am getting caught in the middle and there's a discrepancy which defeats the ACID properties that Databricks offers.

This is the operation metrics of the OPTIMIZE operation:

numRemovedFiles: "2"
numRemovedBytes: "5644151"
p25FileSize: "3234772"
numDeletionVectorsRemoved: "1"
minFileSize: "3234772"
numAddedFiles: "1"
maxFileSize: "3234772"
p75FileSize: "3234772"
p50FileSize: "3234772"
numAddedBytes: "3234772"

and this is the operation metrics of the MERGE operation that ran 10 seconds earlier:

numTargetRowsCopied: "0"
numTargetRowsDeleted: "0"
numTargetFilesAdded: "1"
numTargetBytesAdded: "2409613"
numTargetBytesRemoved: "0"
numTargetDeletionVectorsAdded: "1"
numTargetRowsMatchedUpdated: "20380"
executionTimeMs: "22332"
materializeSourceTimeMs: "5886"
numTargetRowsInserted: "6"
numTargetRowsMatchedDeleted: "0"
numTargetDeletionVectorsUpdated: "0"
scanTimeMs: "1781"
numTargetRowsUpdated: "20380"
numOutputRows: "20386"
numTargetDeletionVectorsRemoved: "0"
numTargetRowsNotMatchedBySourceUpdated: "0"
numTargetChangeFilesAdded: "1"
numSourceRows: "20386"
numTargetFilesRemoved: "0"
numTargetRowsNotMatchedBySourceDeleted: "0"
rewriteTimeMs: "14510"

What could be the reason for this data discrepency and is there any setting that I am missing in these operations?

1 REPLY 1

VZLA
Databricks Employee
Databricks Employee

Can you please provide more context about this, specifically with respect to the DBR Release and reproducibility of this scenario?

Any metrics or plan change differences between both select statements, while the Optimize was in progress and after?

This can be better triaged via a support ticket.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group