This issue is related to how Delta Lakeโs structured streaming interacts with schema evolution and options like startingVersion and schemaTrackingLocation. The behavior you've observed has been noted by other users, and can be subtle due to how checkpointing, versioning, and schema tracking are handled in combination. Hereโs a breakdown, with solutions:
Core Issue
Setting startingVersion as an option in your stream appears to interfere with schema evolution, resulting in the stream persisting the old schemaโeven after the underlying Delta tableโs schema has changed and updates are written to your schemaTrackingLocation.
When you remove startingVersion, the DataStreamReader detects schema changes correctly, provided schema tracking is enabled. From Databricks documentation, startingVersion should only be relevant for initializing a new stream, not for resumes from an existing checkpoint.
Why Does This Happen?
-
Schema Tracking and startingVersion:
-
When startingVersion is set, it can impact which version of the table the streaming query starts reading fromโeven if a checkpoint exists. Certain system versions and Spark releases may not fully disregard this option after checkpoint initialization due to nuanced implementation details behind the scenes.
-
The schema stored at schemaTrackingLocation is used for schema management, but if the stream is โstuckโ at an older version due to how startingVersion is interpreted, it may not trigger schema updates.
-
Checkpoints and Restart Behavior:
-
On restart, if a checkpoint exists, the stream should ignore startingVersion. However, if the checkpoint is missing or corrupted, or if the startingVersion option is reapplied incorrectly, the schema may not evolve as expected.
Suggested Solution
-
Remove startingVersion After Initial Start:
-
Only use the startingVersion option when you first start the stream and no checkpoint exists.
-
After initial startup and successful checkpointing, remove startingVersion so schema tracking works properly on subsequent runs. Schema changes should then be detected and handled via your schemaTrackingLocation.
-
Confirm Checkpoint Health:
-
Make sure your checkpoint directory is healthy and present when restarting the stream. If the checkpoint is not present, startingVersion will be used.
-
Upgrade Databricks & Delta Lake:
-
Certain bugs with schema tracking and stream options have been resolved in later versions of Databricks and Delta Lake. Upgrading may resolve unexpected behaviors.
Workaround (if you need to retain startingVersion logic):
-
Start your stream without the startingVersion once the checkpoint is established, so ongoing runs see schema changes.
-
For testing, you can clear out your checkpoint directory (careful: this resets your offsets and may replay data) then set startingVersion to reinitialize from that version, but be sure to understand the replay implications.
References
Summary
This is not fully โintendedโ behavior, but more a side-effect of how options and checkpointing interact in specific tool versions. Removing startingVersion after initial setup, maintaining your checkpoint, and enabling schema tracking is the correct pattern for evolving schemas in Delta Lake structured streaming. If the problem persists after following this approach and upgrading, it may warrant a support ticket or GitHub issue.