Databricks

tom_shaffner · ‎05-18-2022

In streaming flows I periodically get a "Detected a data update" error. This error generally seem to indicate that something has changed in the source table schema, but it's not immediately apparent what. In one case yesterday I pulled the source table and the destination table from the flow (which worked the day prior) and couldn't see any differences in the schemas. It's possible I missed something and need to write a programmatic comparison, but needing to do so seems to indicate an issue with the error message in my mind.

If I delete and recreate the tables and checkpoints the error will go away, but I can only do this so easily because I'm still in dev. Is there a way, particularly in a case where "wipe and restart" isn't such a great option, to see what is different about the data update that throws such an error? Or a way to add to the error message mention of what has changed?

Kaniz · ‎05-19-2022

Hi @Tom Shaffner , A similar issue in S.O states -

The read stream will throw an exception if there are updates or deletes in your delta source. This is also clear from Databricks documentation:

Structured Streaming does not handle input that is not an append and throws an exception if any modifications occur on the table used as a source.

If you use IgnoreChanges, True, it will not throw an exception but give you the updated rows + rows that could have already been processed.

This is because everything in the delta table happens on the file level.

For example, if you update a single row in a file (roughly), the following will happen:

Find and read the file which contains the record to be updated.
Write a new file that contains the updated document + all other data that was also in the old file.
Mark the old file as removed and the new file as added in the transaction log.
Your read stream will read the whole new file as ’new’ records. This means you can get duplicates in your steam.

This is also mentioned in the docs.

Ignore changes: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Entire rows may still be emitted. Therefore your downstream consumers should be able to handle duplicates. ...

You'll have to decide if this is ok for your use case.

If you need to address updates and deletes specifically, Databricks offers Change Data Feed, which you can enable on delta tables.

This gives you row-level details about inserts, appends, and deletes (at the cost of some extra storage and IO).

tom_shaffner · ‎05-19-2022

@Kaniz Fatma , Thanks, that helps. I was assuming this warning indicated a schema evolution, and based on what you say it likely wasn't and I just have to turn on IgnoreChanges any time I have a stream from a table that receives updates/upserts.

To be clear though, in such a case turning on IgnoreChanges would not prevent errors if I ever have a schema evolution, correct? I would still get a notification of such a change? Or will turning on IgnoreChanges mean I fail to get alerts on that also?

Databricks

"Detected a data update", what changed?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI