Hi @Tom Shaffner , A similar issue in S.O states -
The read stream will throw an exception if there are updates or deletes in your delta source. This is also clear from Databricks documentation:
Structured Streaming does not handle input that is not an append and throws an exception if any modifications occur on the table used as a source.
If you use IgnoreChanges, True, it will not throw an exception but give you the updated rows + rows that could have already been processed.
This is because everything in the delta table happens on the file level.
For example, if you update a single row in a file (roughly), the following will happen:
- Find and read the file which contains the record to be updated.
- Write a new file that contains the updated document + all other data that was also in the old file.
- Mark the old file as removed and the new file as added in the transaction log.
- Your read stream will read the whole new file as ’new’ records. This means you can get duplicates in your steam.
This is also mentioned in the docs.
Ignore changes: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Entire rows may still be emitted. Therefore your downstream consumers should be able to handle duplicates. ...
You'll have to decide if this is ok for your use case.
If you need to address updates and deletes specifically, Databricks offers Change Data Feed, which you can enable on delta tables.
This gives you row-level details about inserts, appends, and deletes (at the cost of some extra storage and IO).