- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
I am sorry, but I have to correct your statement! Especially, this is the foundation!
The checkpoint doesn't store the target schema. It stores the source schema that was in effect when the stream last ran. The offset (yes, stored in the offsets folder) tracks the last committed version from the source Delta table. So on the next micro-batch it reads from version N+1 onwardthat part you've got right.
The schema check happens on the read side, not the write side. When the stream advances through the Delta log from N+1 forward, it validates each transaction entry against the schema it has recorded. If it hits a non-additive change (drop, rename, type change) at any version in that range, it fails immediately before it even produces a DataFrame for that micro-batch. It never gets to the write stage.
The write-side schema comparison is a separate thing. That's where mergeSchema or schema evolution settings come in — it controls whether the target table accepts new columns or structural changes in the incoming DataFrame. But the error you're seeing (DELTA_STREAMING_INCOMPATIBLE_SCHEMA_CHANGE_USE_LOG) is purely a read-side failure. The stream can't even construct the DataFrame to write because it can't reconcile the source log history.
So the flow is:
- Stream wakes up, reads checkpoint (source schema + last source version offset)
- Reads Delta log from N+1 to current version
- If any entry has a non-additive schema change -> fails here, never reaches write
- If all entries are compatible -> constructs DataFrame with data from those versions
- At write time -> compares DataFrame schema with target table schema (this is where mergeSchema matters)
Hope this helps! And sorry for the lengthy explanation, but I feel enabling you is more helpful than on shot answer so that you can take right call while you do migration!