Databricks Community

GarciaJorge · ‎03-31-2026

Hi everyone,

I’m dealing with a scenario combining Delta Live Tables, CDC ingestion, and streaming pipelines, and I’ve hit a challenge that I haven’t seen clearly addressed in the docs.

Some Context:

Source is an upstream system emitting CDC events (insert/update/delete)
Data is ingested via Auto Loader into a bronze layer
From there, I’m using DLT to build silver tables with merge logic (SCD Type 1)
The pipeline runs in continuous/streaming mode

The issue is around schema evolution, especially breaking changes:

column type changes (e.g., int → string)
column drops or renames
nested structure changes

While Auto Loader can handle schema evolution to some extent, downstream DLT transformations (especially merges) tend to fail or behave unpredictably when these changes occur.

My concerns:

avoiding pipeline failures in production
maintaining data quality and historical consistency
not overcomplicating the pipeline with excessive manual handling

Questions:

What’s the best pattern to handle breaking schema changes in this setup?
Do you isolate schema evolution strictly in bronze and enforce contracts from silver onward?
Has anyone implemented schema versioning or schema registry-like patterns with DLT?
How do you balance flexibility (auto evolution) vs governance (strict schemas)?

Would really appreciate insights from anyone who has dealt with this in production.

Thanks!

edonaire · ‎03-31-2026

In my opinion, the most reliable approach is to separate flexibility and control across layers.

First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.

Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.

A pattern that works well:

Bronze: ingest raw data with schema evolution enabled
Intermediate step: normalize the schema by casting types and handling missing or new columns
Silver: apply merge logic using a stable and controlled schema

For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.

For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.

In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.

In summary:

keep bronze flexible
enforce contracts in silver
handle breaking changes explicitly
design for reprocessing

View solution in original post

edonaire · ‎03-31-2026

In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.

At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning strategy, the overhead is minimal. You are only processing new or changed data, not reprocessing the full dataset.

A few things that help keep it efficient:

Keep transformations simple and column-focused, avoid heavy joins in this step
Align processing with partitions, for example by ingestion date or event date
Leverage incremental processing so only affected data is normalized
Avoid unnecessary shuffles by preserving data distribution when possible

In many cases, this layer actually improves overall performance indirectly, because it stabilizes schemas before merges. That reduces failed jobs, retries, and expensive recomputations.

Where you may see impact is if the normalization step becomes too complex or starts doing work that belongs in later layers. Keeping it focused on schema consistency is the key.

So overall, the trade-off is usually very favorable: a small additional cost for a significant gain in reliability..

View solution in original post

edonaire · ‎03-31-2026

In my opinion, the most reliable approach is to separate flexibility and control across layers.

First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.

Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.

A pattern that works well:

Bronze: ingest raw data with schema evolution enabled
Intermediate step: normalize the schema by casting types and handling missing or new columns
Silver: apply merge logic using a stable and controlled schema

For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.

For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.

In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.

In summary:

keep bronze flexible
enforce contracts in silver
handle breaking changes explicitly
design for reprocessing

GarciaJorge · ‎03-31-2026

Thanks, this is very helpful.
The idea of introducing a normalization layer before merges is interesting. I had not considered that as a separate step.

Have you seen any performance impact when adding this extra layer in DLT pipelines at scale?

edonaire · ‎03-31-2026

In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.

At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning strategy, the overhead is minimal. You are only processing new or changed data, not reprocessing the full dataset.

A few things that help keep it efficient:

Keep transformations simple and column-focused, avoid heavy joins in this step
Align processing with partitions, for example by ingestion date or event date
Leverage incremental processing so only affected data is normalized
Avoid unnecessary shuffles by preserving data distribution when possible

In many cases, this layer actually improves overall performance indirectly, because it stabilizes schemas before merges. That reduces failed jobs, retries, and expensive recomputations.

Where you may see impact is if the normalization step becomes too complex or starts doing work that belongs in later layers. Keeping it focused on schema consistency is the key.

So overall, the trade-off is usually very favorable: a small additional cost for a significant gain in reliability..

Databricks Community

DLT with CDC and schema changes in streaming pipelines

DAIS 2026 | Day 3 Recap: That's a wrap. Empty boxes & full hearts.

‌✨‌ DAIS 2026 Community Virtual Contest – Winners Announced! 🏆

🌟 Community Pulse: Your Weekly Roundup! June 08 – 14, 2026

Solution Accelerator Series | Building a Chatbot With Large Language Models (LLMs)

Build apps without jumping through hoops