cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

"Detected a data update", what changed?

tom_shaffner
New Contributor III

In streaming flows I periodically get a "Detected a data update" error. This error generally seem to indicate that something has changed in the source table schema, but it's not immediately apparent what. In one case yesterday I pulled the source table and the destination table from the flow (which worked the day prior) and couldn't see any differences in the schemas. It's possible I missed something and need to write a programmatic comparison, but needing to do so seems to indicate an issue with the error message in my mind.

If I delete and recreate the tables and checkpoints the error will go away, but I can only do this so easily because I'm still in dev. Is there a way, particularly in a case where "wipe and restart" isn't such a great option, to see what is different about the data update that throws such an error? Or a way to add to the error message mention of what has changed?

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Tom Shaffner​ , A similar issue in S.O states -

The read stream will throw an exception if there are updates or deletes in your delta source. This is also clear from Databricks documentation:

Structured Streaming does not handle input that is not an append and throws an exception if any modifications occur on the table used as a source.

If you use IgnoreChanges, True, it will not throw an exception but give you the updated rows + rows that could have already been processed.

This is because everything in the delta table happens on the file level.

For example, if you update a single row in a file (roughly), the following will happen:

  1. Find and read the file which contains the record to be updated.
  2. Write a new file that contains the updated document + all other data that was also in the old file.
  3. Mark the old file as removed and the new file as added in the transaction log.
  4. Your read stream will read the whole new file as ’new’ records. This means you can get duplicates in your steam.

This is also mentioned in the docs.

Ignore changes: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Entire rows may still be emitted. Therefore your downstream consumers should be able to handle duplicates. ...

You'll have to decide if this is ok for your use case.

If you need to address updates and deletes specifically, Databricks offers Change Data Feed, which you can enable on delta tables.

This gives you row-level details about inserts, appends, and deletes (at the cost of some extra storage and IO).

tom_shaffner
New Contributor III

@Kaniz Fatma​ , Thanks, that helps. I was assuming this warning indicated a schema evolution, and based on what you say it likely wasn't and I just have to turn on IgnoreChanges any time I have a stream from a table that receives updates/upserts.

To be clear though, in such a case turning on IgnoreChanges would not prevent errors if I ever have a schema evolution, correct? I would still get a notification of such a change? Or will turning on IgnoreChanges mean I fail to get alerts on that also?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.