Databricks Community

Volker · ‎10-11-2024

Hey Community,

we have a streaming pipeline, starting with autoloader to ingest data into a bronze table and this data gets then picked up by another streaming job that transforms this data and writes into a silver table.

Now there are some schema changes (renaming a column and adding a new column) on the bronze table that we also want to propagate to the silver table.

We added the "schemaTrackingLocation" option to the stream that ingests the data from bronze such that the it does not fail because of these non-additive schema changes but the stream now also does not pick up the schema changes at all. The schema that I get from the readstream operation is still the same as before the schema change in bronze.

Does anyone know how to best handle such schema changes using structured streaming without too much downtime of the stream and too much overhead?

Thank you!

Brahmareddy · ‎10-13-2024

Hi @Volker,

How are you doing today?

As per my understanding, Just try using the mergeSchema option in your streaming read operation to ensure the stream picks up schema changes from the Bronze table automatically. This will help propagate the new schema changes, like column renaming or additions, to the Silver table. Additionally, make sure that the schema evolution is enabled for both the Bronze and Silver tables to accommodate these changes without restarting the stream. If you're dealing with column renames, you may need to implement manual handling for those to ensure the downstream transformations are consistent. Lastly, minimize downtime by using checkpointing, which ensures the stream continues where it left off after applying schema changes.

Give a try and let me know.

Regards,

Brahma

Volker · ‎10-14-2024

Hi @Brahmareddy,

thanks for your reply! Using mergeSchema still requires me to use a new checkpoint for the Stream or? Then I could restart the stream using a new checkpoint and provide a starting version.

Volker · ‎10-14-2024

I feel like

.option("schemaTrackingLocation", checkpoint_path)

should do the trick but if I add this option and alter the source table schema, the schema that the stream reads is still the old one before the schema adaptions and, thus, I cannot propagate the schema changes from bronze to silver.
Could it be that I'm doing something wrong or missing some additional option?

Rishabh-Pandey · ‎10-14-2024

@Volker Make sure that you also enable schema evolution in your write stream operation. You can do this by adding the mergeSchema option when writing to the silver table. This allows the destination table to automatically adapt to the new schema.

silver_df.writeStream \
    .format("delta") \
    .option("mergeSchema", "true") \
    .option("checkpointLocation", checkpoint_path) \
    .outputMode("append") \
    .table("silver_table")

Rishabh Pandey

Volker · ‎10-16-2024

I adjusted the schema in both bronze and silver, such that I do not need schema evolution. The problem is that the DataStreamReader does not pick up the schema changes in bronze.
I already figured out that it has something to do with providing also a starting_version since I tested it with and without starting_version and when I provided a starting version the stream did not pick up the schema changes and if omit the starting version then the stream does pick up the changes.
Now I'm a little confused since starting_version should only concern the stream when the query gets started and then should be ignored according to the documentation.

Databricks Community

Structured Streaming: How to handle Schema Changes in source and target Delta Table

Photos

Join Us as a Local Community Builder!

Business Intelligence in the Era of AI

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Databricks Community Champion - March 2025 - Takuya Omi

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April