<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DLT with CDC and schema changes in streaming pipelines in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152797#M53882</link>
    <description>&lt;P&gt;In my opinion, the most reliable approach is to separate flexibility and control across layers.&lt;/P&gt;&lt;P&gt;First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.&lt;/P&gt;&lt;P&gt;Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.&lt;/P&gt;&lt;P&gt;A pattern that works well:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Bronze: ingest raw data with schema evolution enabled&lt;/LI&gt;&lt;LI&gt;Intermediate step: normalize the schema by casting types and handling missing or new columns&lt;/LI&gt;&lt;LI&gt;Silver: apply merge logic using a stable and controlled schema&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.&lt;/P&gt;&lt;P&gt;For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.&lt;/P&gt;&lt;P&gt;In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.&lt;/P&gt;&lt;P&gt;In summary:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;keep bronze flexible&lt;/LI&gt;&lt;LI&gt;enforce contracts in silver&lt;/LI&gt;&lt;LI&gt;handle breaking changes explicitly&lt;/LI&gt;&lt;LI&gt;design for reprocessing&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Tue, 31 Mar 2026 19:57:00 GMT</pubDate>
    <dc:creator>edonaire</dc:creator>
    <dc:date>2026-03-31T19:57:00Z</dc:date>
    <item>
      <title>DLT with CDC and schema changes in streaming pipelines</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152796#M53881</link>
      <description>&lt;P class=""&gt;Hi everyone,&lt;/P&gt;&lt;P class=""&gt;I’m dealing with a scenario combining Delta Live Tables, CDC ingestion, and streaming pipelines, and I’ve hit a challenge that I haven’t seen clearly addressed in the docs.&lt;/P&gt;&lt;P class=""&gt;Some Context:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;Source is an upstream system emitting CDC events (insert/update/delete)&lt;/LI&gt;&lt;LI&gt;Data is ingested via Auto Loader into a bronze layer&lt;/LI&gt;&lt;LI&gt;From there, I’m using DLT to build silver tables with merge logic (SCD Type 1)&lt;/LI&gt;&lt;LI&gt;The pipeline runs in continuous/streaming mode&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;The issue is around &lt;STRONG&gt;schema evolution&lt;/STRONG&gt;, especially breaking changes:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;column type changes (e.g., int → string)&lt;/LI&gt;&lt;LI&gt;column drops or renames&lt;/LI&gt;&lt;LI&gt;nested structure changes&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;While Auto Loader can handle schema evolution to some extent, downstream DLT transformations (especially merges) tend to fail or behave unpredictably when these changes occur.&lt;/P&gt;&lt;P class=""&gt;My concerns:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;avoiding pipeline failures in production&lt;/LI&gt;&lt;LI&gt;maintaining data quality and historical consistency&lt;/LI&gt;&lt;LI&gt;not overcomplicating the pipeline with excessive manual handling&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;Questions:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;What’s the best pattern to handle breaking schema changes in this setup?&lt;/LI&gt;&lt;LI&gt;Do you isolate schema evolution strictly in bronze and enforce contracts from silver onward?&lt;/LI&gt;&lt;LI&gt;Has anyone implemented schema versioning or schema registry-like patterns with DLT?&lt;/LI&gt;&lt;LI&gt;How do you balance flexibility (auto evolution) vs governance (strict schemas)?&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;Would really appreciate insights from anyone who has dealt with this in production.&lt;/P&gt;&lt;P class=""&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 31 Mar 2026 19:40:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152796#M53881</guid>
      <dc:creator>GarciaJorge</dc:creator>
      <dc:date>2026-03-31T19:40:26Z</dc:date>
    </item>
    <item>
      <title>Re: DLT with CDC and schema changes in streaming pipelines</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152797#M53882</link>
      <description>&lt;P&gt;In my opinion, the most reliable approach is to separate flexibility and control across layers.&lt;/P&gt;&lt;P&gt;First, allow schema evolution only in the bronze layer. This layer should be treated as raw and flexible, where Auto Loader can adapt to upstream changes.&lt;/P&gt;&lt;P&gt;Second, enforce a strict schema from the silver layer onward. This prevents instability in merge operations and downstream transformations.&lt;/P&gt;&lt;P&gt;A pattern that works well:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Bronze: ingest raw data with schema evolution enabled&lt;/LI&gt;&lt;LI&gt;Intermediate step: normalize the schema by casting types and handling missing or new columns&lt;/LI&gt;&lt;LI&gt;Silver: apply merge logic using a stable and controlled schema&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;For type changes, it is safer to handle them explicitly instead of relying on automatic evolution. Implicit changes can lead to failed merges or inconsistent data.&lt;/P&gt;&lt;P&gt;For reprocessing, having the full raw data in bronze is critical. When a breaking change happens, you can update your transformation logic and replay the data without depending on the source system again.&lt;/P&gt;&lt;P&gt;In production, I also recommend adding monitoring to detect schema changes early instead of trying to fully automate recovery.&lt;/P&gt;&lt;P&gt;In summary:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;keep bronze flexible&lt;/LI&gt;&lt;LI&gt;enforce contracts in silver&lt;/LI&gt;&lt;LI&gt;handle breaking changes explicitly&lt;/LI&gt;&lt;LI&gt;design for reprocessing&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Tue, 31 Mar 2026 19:57:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152797#M53882</guid>
      <dc:creator>edonaire</dc:creator>
      <dc:date>2026-03-31T19:57:00Z</dc:date>
    </item>
    <item>
      <title>Re: DLT with CDC and schema changes in streaming pipelines</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152799#M53883</link>
      <description>&lt;P&gt;Thanks, this is very helpful.&lt;BR /&gt;The idea of introducing a normalization layer before merges is interesting. I had not considered that as a separate step.&lt;/P&gt;&lt;P&gt;Have you seen any performance impact when adding this extra layer in DLT pipelines at scale?&lt;/P&gt;</description>
      <pubDate>Tue, 31 Mar 2026 20:16:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152799#M53883</guid>
      <dc:creator>GarciaJorge</dc:creator>
      <dc:date>2026-03-31T20:16:59Z</dc:date>
    </item>
    <item>
      <title>Re: DLT with CDC and schema changes in streaming pipelines</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152840#M53885</link>
      <description>&lt;P&gt;In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.&lt;/P&gt;&lt;P&gt;At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning strategy, the overhead is minimal. You are only processing new or changed data, not reprocessing the full dataset.&lt;/P&gt;&lt;P&gt;A few things that help keep it efficient:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Keep transformations simple and column-focused, avoid heavy joins in this step&lt;/LI&gt;&lt;LI&gt;Align processing with partitions, for example by ingestion date or event date&lt;/LI&gt;&lt;LI&gt;Leverage incremental processing so only affected data is normalized&lt;/LI&gt;&lt;LI&gt;Avoid unnecessary shuffles by preserving data distribution when possible&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;In many cases, this layer actually improves overall performance indirectly, because it stabilizes schemas before merges. That reduces failed jobs, retries, and expensive recomputations.&lt;/P&gt;&lt;P&gt;Where you may see impact is if the normalization step becomes too complex or starts doing work that belongs in later layers. Keeping it focused on schema consistency is the key.&lt;/P&gt;&lt;P&gt;So overall, the trade-off is usually very favorable: a small additional cost for a significant gain in reliability..&lt;/P&gt;</description>
      <pubDate>Wed, 01 Apr 2026 01:10:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-with-cdc-and-schema-changes-in-streaming-pipelines/m-p/152840#M53885</guid>
      <dc:creator>edonaire</dc:creator>
      <dc:date>2026-04-01T01:10:45Z</dc:date>
    </item>
  </channel>
</rss>

