cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT with CDC and schema changes in streaming pipelines

GarciaJorge
Visitor

Hi everyone,

I’m dealing with a scenario combining Delta Live Tables, CDC ingestion, and streaming pipelines, and I’ve hit a challenge that I haven’t seen clearly addressed in the docs.

Some Context:

  • Source is an upstream system emitting CDC events (insert/update/delete)
  • Data is ingested via Auto Loader into a bronze layer
  • From there, I’m using DLT to build silver tables with merge logic (SCD Type 1)
  • The pipeline runs in continuous/streaming mode

The issue is around schema evolution, especially breaking changes:

  • column type changes (e.g., int → string)
  • column drops or renames
  • nested structure changes

While Auto Loader can handle schema evolution to some extent, downstream DLT transformations (especially merges) tend to fail or behave unpredictably when these changes occur.

My concerns:

  • avoiding pipeline failures in production
  • maintaining data quality and historical consistency
  • not overcomplicating the pipeline with excessive manual handling

Questions:

  • What’s the best pattern to handle breaking schema changes in this setup?
  • Do you isolate schema evolution strictly in bronze and enforce contracts from silver onward?
  • Has anyone implemented schema versioning or schema registry-like patterns with DLT?
  • How do you balance flexibility (auto evolution) vs governance (strict schemas)?

Would really appreciate insights from anyone who has dealt with this in production.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

edonaire
New Contributor

In my opinion, the most reliable approach is to separate flexibility and control across layers.

First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.

Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.

A pattern that works well:

  1. Bronze: ingest raw data with schema evolution enabled
  2. Intermediate step: normalize the schema by casting types and handling missing or new columns
  3. Silver: apply merge logic using a stable and controlled schema

For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.

For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.

In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.

In summary:

  • keep bronze flexible
  • enforce contracts in silver
  • handle breaking changes explicitly
  • design for reprocessing

View solution in original post

2 REPLIES 2

edonaire
New Contributor

In my opinion, the most reliable approach is to separate flexibility and control across layers.

First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.

Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.

A pattern that works well:

  1. Bronze: ingest raw data with schema evolution enabled
  2. Intermediate step: normalize the schema by casting types and handling missing or new columns
  3. Silver: apply merge logic using a stable and controlled schema

For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.

For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.

In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.

In summary:

  • keep bronze flexible
  • enforce contracts in silver
  • handle breaking changes explicitly
  • design for reprocessing

Thanks, this is very helpful.
The idea of introducing a normalization layer before merges is interesting. I had not considered that as a separate step.

Have you seen any performance impact when adding this extra layer in DLT pipelines at scale?