cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

how to stream change feed from delta table when its schema is changed?

raghav99
New Contributor II

Hi Team,
I would like to know how we can continue streaming change data feed from a delta table when its schema is changed ( non-additive schema changes like drop/rename column / schema migration ).

I came across schemaTrackingLocation in readStream but not sure how to use it for continue streaming the change feeds. I am using Delta Lake v3.0 and Databricks 14.0.
Thank you in advance.

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @raghav99 , 

When dealing with schema changes in Delta Lake tables, especially in a streaming context, it’s essential to understand the behaviour and implications.

Let’s address your observations and concerns:

  1. Schema Tracking Location:

  2. Behavior with Schema Evolution:

    • By default, Delta Lake supports schema evolution, which means that it can handle schema changes without breaking existing data.
    • When new columns are added or existing columns are updated, Delta Lake can merge the new schema with the existing one.
    • However, certain changes (like renaming columns) may require additional considerations.
  3. Column Mapping and Schema Changes:

    • When you enable column mapping for a Delta table, it upgrades the table version and allows metadata-only changes (e.g., renaming or marking columns as deleted) without rewriting data files.
    • However, multiple schema changes can be challenging to handle in a streaming context.
    • The behavior you’re observing—having to rerun the stream multiple times—is expected when dealing with complex schema changes.
  4. Streaming Behavior:

    • While schemaTrackingLocation helps track schema changes, it doesn’t guarantee that the stream won’t fail during schema updates.
    • Streams may need to be restarted to pick up the latest schema.
    • The frequency of stream restarts depends on the rate of schema changes and the complexity of those changes.
  5. Optimizing Schema Changes:

    • Consider the following strategies:
      • Batch Schema Changes: Instead of applying many schema changes at once, batch them if possible. Apply changes incrementally to minimize disruptions.
      • Monitor Schema Changes: Keep an eye on schema changes and adjust your stream restart strategy accordingly.
      • Schema Evolution Rules: Understand the rules for schema evolution in Delta Lake (e.g., adding columns, changing data types).
      • Testing: Test schema changes in a non-production environment before applying them to production streams.
  6. Final Thoughts:

    • While Delta Lake provides robust schema management, handling complex schema changes in a streaming context can be challenging.
    • Regularly review the documentation and stay informed about any updates or improvements related to schema evolution and streaming behavior.

Remember that schema management is critical for maintaining data consistency and reliability.

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @raghav99, Certainly! Streaming change data feeds from a Delta table with schema changes can be achieved using the schemaTrackingLocation option. 

 

Let’s dive into the details:

 

Delta Table as a Source:

  • When you load a Delta table as a stream source and use it in a streaming query, the query processes all existing data in the table as well as any new data arriving after the stream starts.
  • You can load a Delta table as a stream source using either the table name or the file path:
  • Note that if the schema of a Delta table changes after a streaming read begins, the query fails. For most schema changes, you can restart the stream to resolve schema mismatches and continue processing.

Limiting Input Rate:

  • You can control micro-batches using the following options:
    • maxFilesPerTrigger: Determines how many new files are considered in each micro-batch (default is 1000).
    • maxBytesPerTrigger: Sets a “soft max” for data processed in each micro-batch. It may process more than the limit to keep the streaming query moving forward.
    • If you use maxBytesPerTrigger along with maxFilesPerTrigger, the micro-batch processes data until either limit is reached.
    • If the source table transactions are cleaned up due to log retention settings, the stream processes data corresponding to the latest available transaction history but avoids failing the stream.

Handling Schema Changes:

  • To address non-additive schema changes (e.g., renaming or dropping columns), you can provide a schemaTrackingLocation.
  • Each streaming read against a data source must have its own schemaTrackingLocation specified.
  • This overcomes issues where schema changes could break streams.
  • Set the delta.enableChangeDataFeed configuration to true to enable the change data feed feature.
  • Use the table_changes() function to view the changes in the Delta table.

Streaming from Delta tables with schema changes is powerful, and with the right configuration, you can handle updates, inserts, and deletes seamlessly. 🚀.

Thank you for the response @Kaniz_Fatma .

So lets consider one or many schema changes per table. It can be combinations of drop , rename column , add new column and updating column's datatype, etc.
I am using schemaTrackingLocation and have observed stream breaking with StreamingQueryException.
It updates the schema tracking log but since it has multiple schema changes , I have to rerun the stream multiple times to keep on updating the schema tracking maybe because there are lots of schema updates happened on the table while doing these migration combinations.

I am not sure how many times I need to re-trigger the stream to get it to final updated schema on my schemaTrackingLocation. 
So is this an expected behaviour or a bug ? 

I was under impression looking at the docs that with schemaTrackingLocation enabled , stream doesn't fail and it will keep on updating and picking up the right schema from source table.

Hi @raghav99 , 

When dealing with schema changes in Delta Lake tables, especially in a streaming context, it’s essential to understand the behaviour and implications.

Let’s address your observations and concerns:

  1. Schema Tracking Location:

  2. Behavior with Schema Evolution:

    • By default, Delta Lake supports schema evolution, which means that it can handle schema changes without breaking existing data.
    • When new columns are added or existing columns are updated, Delta Lake can merge the new schema with the existing one.
    • However, certain changes (like renaming columns) may require additional considerations.
  3. Column Mapping and Schema Changes:

    • When you enable column mapping for a Delta table, it upgrades the table version and allows metadata-only changes (e.g., renaming or marking columns as deleted) without rewriting data files.
    • However, multiple schema changes can be challenging to handle in a streaming context.
    • The behavior you’re observing—having to rerun the stream multiple times—is expected when dealing with complex schema changes.
  4. Streaming Behavior:

    • While schemaTrackingLocation helps track schema changes, it doesn’t guarantee that the stream won’t fail during schema updates.
    • Streams may need to be restarted to pick up the latest schema.
    • The frequency of stream restarts depends on the rate of schema changes and the complexity of those changes.
  5. Optimizing Schema Changes:

    • Consider the following strategies:
      • Batch Schema Changes: Instead of applying many schema changes at once, batch them if possible. Apply changes incrementally to minimize disruptions.
      • Monitor Schema Changes: Keep an eye on schema changes and adjust your stream restart strategy accordingly.
      • Schema Evolution Rules: Understand the rules for schema evolution in Delta Lake (e.g., adding columns, changing data types).
      • Testing: Test schema changes in a non-production environment before applying them to production streams.
  6. Final Thoughts:

    • While Delta Lake provides robust schema management, handling complex schema changes in a streaming context can be challenging.
    • Regularly review the documentation and stay informed about any updates or improvements related to schema evolution and streaming behavior.

Remember that schema management is critical for maintaining data consistency and reliability.

Kaniz_Fatma
Community Manager
Community Manager

I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.




 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group