yesterday
Hi everyone,
I’m dealing with a scenario combining Delta Live Tables, CDC ingestion, and streaming pipelines, and I’ve hit a challenge that I haven’t seen clearly addressed in the docs.
Some Context:
The issue is around schema evolution, especially breaking changes:
While Auto Loader can handle schema evolution to some extent, downstream DLT transformations (especially merges) tend to fail or behave unpredictably when these changes occur.
My concerns:
Questions:
Would really appreciate insights from anyone who has dealt with this in production.
Thanks!
yesterday - last edited yesterday
In my opinion, the most reliable approach is to separate flexibility and control across layers.
First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.
Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.
A pattern that works well:
For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.
For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.
In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.
In summary:
yesterday
In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.
At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning strategy, the overhead is minimal. You are only processing new or changed data, not reprocessing the full dataset.
A few things that help keep it efficient:
In many cases, this layer actually improves overall performance indirectly, because it stabilizes schemas before merges. That reduces failed jobs, retries, and expensive recomputations.
Where you may see impact is if the normalization step becomes too complex or starts doing work that belongs in later layers. Keeping it focused on schema consistency is the key.
So overall, the trade-off is usually very favorable: a small additional cost for a significant gain in reliability..
yesterday - last edited yesterday
In my opinion, the most reliable approach is to separate flexibility and control across layers.
First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.
Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.
A pattern that works well:
For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.
For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.
In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.
In summary:
yesterday
Thanks, this is very helpful.
The idea of introducing a normalization layer before merges is interesting. I had not considered that as a separate step.
Have you seen any performance impact when adding this extra layer in DLT pipelines at scale?
yesterday
In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.
At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning strategy, the overhead is minimal. You are only processing new or changed data, not reprocessing the full dataset.
A few things that help keep it efficient:
In many cases, this layer actually improves overall performance indirectly, because it stabilizes schemas before merges. That reduces failed jobs, retries, and expensive recomputations.
Where you may see impact is if the normalization step becomes too complex or starts doing work that belongs in later layers. Keeping it focused on schema consistency is the key.
So overall, the trade-off is usually very favorable: a small additional cost for a significant gain in reliability..