cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT with CDC and schema changes in streaming pipelines

GarciaJorge
New Contributor

Hi everyone,

I’m dealing with a scenario combining Delta Live Tables, CDC ingestion, and streaming pipelines, and I’ve hit a challenge that I haven’t seen clearly addressed in the docs.

Some Context:

  • Source is an upstream system emitting CDC events (insert/update/delete)
  • Data is ingested via Auto Loader into a bronze layer
  • From there, I’m using DLT to build silver tables with merge logic (SCD Type 1)
  • The pipeline runs in continuous/streaming mode

The issue is around schema evolution, especially breaking changes:

  • column type changes (e.g., int → string)
  • column drops or renames
  • nested structure changes

While Auto Loader can handle schema evolution to some extent, downstream DLT transformations (especially merges) tend to fail or behave unpredictably when these changes occur.

My concerns:

  • avoiding pipeline failures in production
  • maintaining data quality and historical consistency
  • not overcomplicating the pipeline with excessive manual handling

Questions:

  • What’s the best pattern to handle breaking schema changes in this setup?
  • Do you isolate schema evolution strictly in bronze and enforce contracts from silver onward?
  • Has anyone implemented schema versioning or schema registry-like patterns with DLT?
  • How do you balance flexibility (auto evolution) vs governance (strict schemas)?

Would really appreciate insights from anyone who has dealt with this in production.

Thanks!

2 ACCEPTED SOLUTIONS

Accepted Solutions

edonaire
New Contributor

In my opinion, the most reliable approach is to separate flexibility and control across layers.

First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.

Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.

A pattern that works well:

  1. Bronze: ingest raw data with schema evolution enabled
  2. Intermediate step: normalize the schema by casting types and handling missing or new columns
  3. Silver: apply merge logic using a stable and controlled schema

For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.

For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.

In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.

In summary:

  • keep bronze flexible
  • enforce contracts in silver
  • handle breaking changes explicitly
  • design for reprocessing

View solution in original post

edonaire
New Contributor

In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.

At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning strategy, the overhead is minimal. You are only processing new or changed data, not reprocessing the full dataset.

A few things that help keep it efficient:

  • Keep transformations simple and column-focused, avoid heavy joins in this step
  • Align processing with partitions, for example by ingestion date or event date
  • Leverage incremental processing so only affected data is normalized
  • Avoid unnecessary shuffles by preserving data distribution when possible

In many cases, this layer actually improves overall performance indirectly, because it stabilizes schemas before merges. That reduces failed jobs, retries, and expensive recomputations.

Where you may see impact is if the normalization step becomes too complex or starts doing work that belongs in later layers. Keeping it focused on schema consistency is the key.

So overall, the trade-off is usually very favorable: a small additional cost for a significant gain in reliability..

View solution in original post

3 REPLIES 3

edonaire
New Contributor

In my opinion, the most reliable approach is to separate flexibility and control across layers.

First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.

Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.

A pattern that works well:

  1. Bronze: ingest raw data with schema evolution enabled
  2. Intermediate step: normalize the schema by casting types and handling missing or new columns
  3. Silver: apply merge logic using a stable and controlled schema

For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.

For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.

In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.

In summary:

  • keep bronze flexible
  • enforce contracts in silver
  • handle breaking changes explicitly
  • design for reprocessing

Thanks, this is very helpful.
The idea of introducing a normalization layer before merges is interesting. I had not considered that as a separate step.

Have you seen any performance impact when adding this extra layer in DLT pipelines at scale?

edonaire
New Contributor

In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.

At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning strategy, the overhead is minimal. You are only processing new or changed data, not reprocessing the full dataset.

A few things that help keep it efficient:

  • Keep transformations simple and column-focused, avoid heavy joins in this step
  • Align processing with partitions, for example by ingestion date or event date
  • Leverage incremental processing so only affected data is normalized
  • Avoid unnecessary shuffles by preserving data distribution when possible

In many cases, this layer actually improves overall performance indirectly, because it stabilizes schemas before merges. That reduces failed jobs, retries, and expensive recomputations.

Where you may see impact is if the normalization step becomes too complex or starts doing work that belongs in later layers. Keeping it focused on schema consistency is the key.

So overall, the trade-off is usually very favorable: a small additional cost for a significant gain in reliability..