<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Architecture Advice: DLT Strategy for Daily Snapshots to SCD2 with &amp;quot;Grace Period&amp;quot; Dele in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/architecture-advice-dlt-strategy-for-daily-snapshots-to-scd2/m-p/149280#M53058</link>
    <description>&lt;DIV style="font-family: sans-serif; line-height: 1.6; color: #333;"&gt;
&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/216744"&gt;@samuelperezh&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Pattern for Grace Period&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The &lt;CODE&gt;auto_cdc_from_snapshot&lt;/CODE&gt; &lt;CODE&gt;SCD2&lt;/CODE&gt; should be able to naturally take care of your scenario. Let's walk through an example:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. Initial load with 2 products on 25/02/2026:&lt;/STRONG&gt;&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The target of the &lt;CODE&gt;auto_cdc_from_snapshot&lt;/CODE&gt; &lt;CODE&gt;SCD2&lt;/CODE&gt; will be:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__START_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__END_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;STRONG&gt;2. Assume that &lt;CODE&gt;bar&lt;/CODE&gt; and &lt;CODE&gt;baz&lt;/CODE&gt; disappear on the load of 26/02/2026:&lt;/STRONG&gt;&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The target of &lt;CODE&gt;auto_cdc_from_snapshot&lt;/CODE&gt; will be updated to:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__START_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__END_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;STRONG&gt;3. The &lt;CODE&gt;bar&lt;/CODE&gt; reappears the next day, on 27/02/2026:&lt;/STRONG&gt;&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The target &lt;CODE&gt;CDC&lt;/CODE&gt; table now reflects the reappearance:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__START_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__END_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;27/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The wrongly deleted data will appear as a duplicate which needs to be handled in the downstream:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;If the new &lt;CODE&gt;__START_AT&lt;/CODE&gt; is earlier than &lt;CODE&gt;previous __END_AT + 3 days&lt;/CODE&gt;, we keep the record. Otherwise, it will be ignored.&lt;/LI&gt;
&lt;LI&gt;For each identical row, we retrieve only the &lt;CODE&gt;newest&lt;/CODE&gt; record.&lt;/LI&gt;
&lt;/UL&gt;
&lt;PRE style="background-color: #2d2d2d; color: #ccc; padding: 15px; border-radius: 5px; overflow-x: auto;"&gt;&lt;CODE&gt;CREATE MATERIALIZED VIEW target_auto_cdc_clean AS
WITH filtered AS (
  SELECT *,
    LEAD(__END_AT) OVER (PARTITION BY name, age ORDER BY __END_AT DESC NULLS FIRST) AS prev_end_at 
  FROM names_cdc
),
valid AS (
  SELECT *
  FROM filtered
  WHERE prev_end_at IS NULL 
     OR (
          prev_end_at IS NOT NULL 
          AND __START_AT &amp;lt;= prev_end_at + INTERVAL 3 DAYS 
        )
),
ranked AS (
  SELECT *, row_number() OVER (PARTITION BY name, age ORDER BY __START_AT DESC) AS rn 
  FROM valid
)
SELECT * FROM ranked WHERE rn = 1&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;This will return:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__START_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__END_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;27/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;Then, to apply the grace logic:&lt;/P&gt;
&lt;PRE style="background-color: #2d2d2d; color: #ccc; padding: 15px; border-radius: 5px; overflow-x: auto;"&gt;&lt;CODE&gt;SELECT *, 
    CASE 
        WHEN __END_AT &amp;gt; current_date() + INTERVAL 3 DAYS THEN TRUE
        ELSE FALSE 
    END as is_sold,
    CASE
        WHEN __END_AT &amp;gt; current_date() + INTERVAL 3 DAYS THEN __END_AT
        ELSE NULL END as sold_date
FROM target_auto_cdc_clean&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;DIV&gt;Feel free to double check the queries to make sure they follow the business logic.&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;2. &lt;STRONG&gt;Backfill strategy.&amp;nbsp;&lt;/STRONG&gt;You can leverage the&amp;nbsp;&lt;SPAN&gt;&lt;CODE&gt;next_snapshot_and_version&lt;/CODE&gt; function to loop over the past dates to perform &lt;A href="https://docs.databricks.com/aws/en/ldp/cdc#example-historical-snapshot-processing" target="_self"&gt;historical backfilling&lt;/A&gt;. However, this would be an expensive operation if you have a lot of history. There is a tradeoff between the history depth in the SCD2 table vs time and cost spent to build it.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;3. &lt;STRONG&gt;&lt;SPAN class=""&gt;Bronze Cleanup.&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;Be mindful that, if you delete the data from the bronze tables, you won't be able to recompute the whole history. For example, if you hit &lt;CODE&gt;Run pipeline with full table refresh&lt;/CODE&gt;, it will drop the AUTO CDC target table and start from scratch. If only last 7 days of data are present in the bronze layer, the new fully refreshed AUTO CDC target will contain only the last 7 days. You can m&lt;/SPAN&gt;&lt;SPAN class=""&gt;ove the old bronze data into a cold storage to reduce costs. Retrieve it whenever a full refresh is needed. This is again a tradeoff between cost and history depth. You can also prevent full table refresh at all by&amp;nbsp;setting &lt;CODE data-index-in-node="73" data-path-to-node="10,0"&gt;pipelines.reset.allowed = false&lt;/CODE&gt; in the&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/ldp/properties#pipeline-table-properties" target="_self"&gt;pipeline&lt;/A&gt;.&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN class=""&gt;Hope it helps.&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN class=""&gt;Best regards,&lt;/SPAN&gt;&lt;/DIV&gt;</description>
    <pubDate>Wed, 25 Feb 2026 11:44:45 GMT</pubDate>
    <dc:creator>aleksandra_ch</dc:creator>
    <dc:date>2026-02-25T11:44:45Z</dc:date>
    <item>
      <title>Architecture Advice: DLT Strategy for Daily Snapshots to SCD2 with "Grace Period" Deletes</title>
      <link>https://community.databricks.com/t5/data-engineering/architecture-advice-dlt-strategy-for-daily-snapshots-to-scd2/m-p/148751#M52964</link>
      <description>&lt;P class=""&gt;&lt;SPAN class=""&gt;Hi everyone,&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;I’m looking for architectural advice on building a Silver layer in DLT. I am dealing with inventory data and need to handle historical tracking, "sold" logic based on disappearance, and storage cost optimization.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Here's how the situation looks like:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Source (Bronze):&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; I receive a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;full snapshot&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; of the inventory every day (~60GB/day).&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;History:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; I have raw snapshots sitting in object storage starting from September 2025 (it's already at around 7TB).&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Goal (Silver):&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; I need an &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;SCD Type 2&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; table to track changes in attributes (price, status) and determine when a product is sold.&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;These are the challenges I'm having:&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;"Sold" Logic with Grace Period:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;Since the source is a full snapshot, a missing &lt;/SPAN&gt;&lt;SPAN class=""&gt;product id&lt;/SPAN&gt;&lt;SPAN class=""&gt; implies it was sold. However, data quality issues mean products sometimes disappear for 1-2 days and then reappear.&lt;/SPAN&gt;&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;I need to implement a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;3-day grace period&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;If a product id is missing for &amp;gt;3 consecutive days, mark it as &lt;/SPAN&gt;&lt;SPAN class=""&gt;is_sold = True&lt;/SPAN&gt;&lt;SPAN class=""&gt; and set &lt;/SPAN&gt;&lt;SPAN class=""&gt;sold_date&lt;/SPAN&gt;&lt;SPAN class=""&gt; to the first day it went missing.&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;If it reappears within 3 days, it should remain active (ignore the gap).&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Backfill + Incremental:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;I need to process the history (Sept 2025 to Today) sequentially to build the SCD2 history correctly, and then switch to daily incremental processing for new snapshots.&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Storage Costs:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;My cloud storage costs are high ($300+/mo) because I am retaining daily 60GB full snapshots in Bronze indefinitely. Once the data is processed into the SCD2 Silver layer, I want to aggressively clean up the Bronze &lt;/SPAN&gt;table/files and&lt;SPAN class=""&gt;&amp;nbsp;keep only 7 days.&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;I am looking at using &lt;/SPAN&gt;&lt;SPAN class=""&gt;dlt.create_auto_cdc_from_snapshot_flow&lt;/SPAN&gt;&lt;SPAN class=""&gt;. However, I'm unsure how to inject the &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Grace Period&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; logic into the native CDC flow, as the native function tends to mark deletes immediately upon disappearance.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;My Questions:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Pattern for Grace Period:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; Is it better to stick with &lt;/SPAN&gt;&lt;SPAN class=""&gt;create_auto_cdc_from_snapshot_flow&lt;/SPAN&gt;&lt;SPAN class=""&gt; and try to manipulate the snapshot frame before passing it to the CDC engine like carrying forward missing rows for 3 days? Or should I write a custom &lt;/SPAN&gt;&lt;SPAN class=""&gt;apply_changes&lt;/SPAN&gt;&lt;SPAN class=""&gt; logic that manages a "consecutive_missing_days" counter manually?&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Backfill Strategy:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; For the initial run, should I create a logic that loops through all historical dates in order within the DLT definition? Or is there a better way to "replay" history into an SCD2 table?&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Bronze Cleanup:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; Since Silver will contain the full history (SCD2), is it safe/recommended to run a daily &lt;/SPAN&gt;&lt;SPAN class=""&gt;VACUUM&lt;/SPAN&gt;&lt;SPAN class=""&gt; and &lt;/SPAN&gt;&lt;SPAN class=""&gt;DELETE&lt;/SPAN&gt;&lt;SPAN class=""&gt; on the Bronze table to keep only the last 7 days of snapshots? I want to ensure this doesn't break the DLT lineage or backfill capabilities if I ever need to recompute from &lt;/SPAN&gt;&lt;SPAN class=""&gt;recent&lt;/SPAN&gt;&lt;SPAN class=""&gt; history.&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Any snippets or patterns for handling this problem on full snapshots would be greatly appreciated!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Feb 2026 23:46:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/architecture-advice-dlt-strategy-for-daily-snapshots-to-scd2/m-p/148751#M52964</guid>
      <dc:creator>samuelperezh</dc:creator>
      <dc:date>2026-02-18T23:46:05Z</dc:date>
    </item>
    <item>
      <title>Re: Architecture Advice: DLT Strategy for Daily Snapshots to SCD2 with "Grace Period" Dele</title>
      <link>https://community.databricks.com/t5/data-engineering/architecture-advice-dlt-strategy-for-daily-snapshots-to-scd2/m-p/149280#M53058</link>
      <description>&lt;DIV style="font-family: sans-serif; line-height: 1.6; color: #333;"&gt;
&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/216744"&gt;@samuelperezh&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Pattern for Grace Period&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The &lt;CODE&gt;auto_cdc_from_snapshot&lt;/CODE&gt; &lt;CODE&gt;SCD2&lt;/CODE&gt; should be able to naturally take care of your scenario. Let's walk through an example:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. Initial load with 2 products on 25/02/2026:&lt;/STRONG&gt;&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The target of the &lt;CODE&gt;auto_cdc_from_snapshot&lt;/CODE&gt; &lt;CODE&gt;SCD2&lt;/CODE&gt; will be:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__START_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__END_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;STRONG&gt;2. Assume that &lt;CODE&gt;bar&lt;/CODE&gt; and &lt;CODE&gt;baz&lt;/CODE&gt; disappear on the load of 26/02/2026:&lt;/STRONG&gt;&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The target of &lt;CODE&gt;auto_cdc_from_snapshot&lt;/CODE&gt; will be updated to:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__START_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__END_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;STRONG&gt;3. The &lt;CODE&gt;bar&lt;/CODE&gt; reappears the next day, on 27/02/2026:&lt;/STRONG&gt;&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The target &lt;CODE&gt;CDC&lt;/CODE&gt; table now reflects the reappearance:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__START_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__END_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;27/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;The wrongly deleted data will appear as a duplicate which needs to be handled in the downstream:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;If the new &lt;CODE&gt;__START_AT&lt;/CODE&gt; is earlier than &lt;CODE&gt;previous __END_AT + 3 days&lt;/CODE&gt;, we keep the record. Otherwise, it will be ignored.&lt;/LI&gt;
&lt;LI&gt;For each identical row, we retrieve only the &lt;CODE&gt;newest&lt;/CODE&gt; record.&lt;/LI&gt;
&lt;/UL&gt;
&lt;PRE style="background-color: #2d2d2d; color: #ccc; padding: 15px; border-radius: 5px; overflow-x: auto;"&gt;&lt;CODE&gt;CREATE MATERIALIZED VIEW target_auto_cdc_clean AS
WITH filtered AS (
  SELECT *,
    LEAD(__END_AT) OVER (PARTITION BY name, age ORDER BY __END_AT DESC NULLS FIRST) AS prev_end_at 
  FROM names_cdc
),
valid AS (
  SELECT *
  FROM filtered
  WHERE prev_end_at IS NULL 
     OR (
          prev_end_at IS NOT NULL 
          AND __START_AT &amp;lt;= prev_end_at + INTERVAL 3 DAYS 
        )
),
ranked AS (
  SELECT *, row_number() OVER (PARTITION BY name, age ORDER BY __START_AT DESC) AS rn 
  FROM valid
)
SELECT * FROM ranked WHERE rn = 1&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;This will return:&lt;/P&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; table-layout: fixed; text-align: left; margin-bottom: 20px;" border="1"&gt;
&lt;THEAD&gt;
&lt;TR style="background-color: #f2f2f2; font-size: 1.1em;"&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;name&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;price&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__START_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;TH style="padding: 10px;"&gt;&lt;FONT size="3"&gt;&lt;CODE&gt;__END_AT&lt;/CODE&gt;&lt;/FONT&gt;&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;1&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;foo&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;10&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;2&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;bar&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;20&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;27/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;NULL&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 8px;"&gt;3&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;baz&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;30&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;25/02/2026&lt;/TD&gt;
&lt;TD style="padding: 8px;"&gt;26/02/2026&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;Then, to apply the grace logic:&lt;/P&gt;
&lt;PRE style="background-color: #2d2d2d; color: #ccc; padding: 15px; border-radius: 5px; overflow-x: auto;"&gt;&lt;CODE&gt;SELECT *, 
    CASE 
        WHEN __END_AT &amp;gt; current_date() + INTERVAL 3 DAYS THEN TRUE
        ELSE FALSE 
    END as is_sold,
    CASE
        WHEN __END_AT &amp;gt; current_date() + INTERVAL 3 DAYS THEN __END_AT
        ELSE NULL END as sold_date
FROM target_auto_cdc_clean&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;DIV&gt;Feel free to double check the queries to make sure they follow the business logic.&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;2. &lt;STRONG&gt;Backfill strategy.&amp;nbsp;&lt;/STRONG&gt;You can leverage the&amp;nbsp;&lt;SPAN&gt;&lt;CODE&gt;next_snapshot_and_version&lt;/CODE&gt; function to loop over the past dates to perform &lt;A href="https://docs.databricks.com/aws/en/ldp/cdc#example-historical-snapshot-processing" target="_self"&gt;historical backfilling&lt;/A&gt;. However, this would be an expensive operation if you have a lot of history. There is a tradeoff between the history depth in the SCD2 table vs time and cost spent to build it.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;3. &lt;STRONG&gt;&lt;SPAN class=""&gt;Bronze Cleanup.&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;Be mindful that, if you delete the data from the bronze tables, you won't be able to recompute the whole history. For example, if you hit &lt;CODE&gt;Run pipeline with full table refresh&lt;/CODE&gt;, it will drop the AUTO CDC target table and start from scratch. If only last 7 days of data are present in the bronze layer, the new fully refreshed AUTO CDC target will contain only the last 7 days. You can m&lt;/SPAN&gt;&lt;SPAN class=""&gt;ove the old bronze data into a cold storage to reduce costs. Retrieve it whenever a full refresh is needed. This is again a tradeoff between cost and history depth. You can also prevent full table refresh at all by&amp;nbsp;setting &lt;CODE data-index-in-node="73" data-path-to-node="10,0"&gt;pipelines.reset.allowed = false&lt;/CODE&gt; in the&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/ldp/properties#pipeline-table-properties" target="_self"&gt;pipeline&lt;/A&gt;.&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN class=""&gt;Hope it helps.&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN class=""&gt;Best regards,&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 25 Feb 2026 11:44:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/architecture-advice-dlt-strategy-for-daily-snapshots-to-scd2/m-p/149280#M53058</guid>
      <dc:creator>aleksandra_ch</dc:creator>
      <dc:date>2026-02-25T11:44:45Z</dc:date>
    </item>
    <item>
      <title>Hi @samuelperezh, Building on @aleksandra_ch's reply, I w...</title>
      <link>https://community.databricks.com/t5/data-engineering/architecture-advice-dlt-strategy-for-daily-snapshots-to-scd2/m-p/150318#M53354</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/216744"&gt;@samuelperezh&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Building on &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102072"&gt;@aleksandra_ch&lt;/a&gt;'s reply, I wanted to add some additional detail around each of your three questions, especially around the grace period implementation and the backfill strategy.&lt;/P&gt;
&lt;P&gt;1. GRACE PERIOD PATTERN&lt;/P&gt;
&lt;P&gt;As aleksandra_ch noted, create_auto_cdc_from_snapshot_flow with stored_as_scd_type=2 is the right starting point. The SCD2 target table automatically sets __END_AT when a product disappears from a snapshot, then creates a new active row if it reappears. This gives you the raw history you need.&lt;/P&gt;
&lt;P&gt;The grace period logic is best handled as a downstream materialized view in your pipeline. The idea is: let the AUTO CDC target capture every disappearance and reappearance faithfully, then apply your 3-day grace window in a clean-up layer on top.&lt;/P&gt;
&lt;P&gt;Here is a corrected and more complete pattern. The key insight is that you want to "stitch together" consecutive active periods for the same product_id when the gap between them is 3 days or fewer:&lt;/P&gt;
&lt;PRE&gt;CREATE MATERIALIZED VIEW inventory_scd2_with_grace AS
WITH ordered AS (
SELECT
  *,
  LAG(__END_AT) OVER (
    PARTITION BY product_id
    ORDER BY __START_AT
  ) AS prev_end_at
FROM inventory_scd2_raw
),
stitched AS (
SELECT
  *,
  CASE
    WHEN prev_end_at IS NOT NULL
      AND __START_AT &amp;lt;= prev_end_at + INTERVAL 3 DAYS
    THEN 0
    ELSE 1
  END AS new_group_flag
FROM ordered
),
grouped AS (
SELECT
  *,
  SUM(new_group_flag) OVER (
    PARTITION BY product_id
    ORDER BY __START_AT
    ROWS UNBOUNDED PRECEDING
  ) AS group_id
FROM stitched
)
SELECT
product_id,
-- take attributes from the latest row in each group
LAST_VALUE(name) OVER (
  PARTITION BY product_id, group_id
  ORDER BY __START_AT
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS name,
LAST_VALUE(price) OVER (
  PARTITION BY product_id, group_id
  ORDER BY __START_AT
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS price,
MIN(__START_AT) OVER (
  PARTITION BY product_id, group_id
) AS valid_from,
MAX(__END_AT) OVER (
  PARTITION BY product_id, group_id
) AS valid_to
FROM grouped&lt;/PRE&gt;
&lt;P&gt;Then to derive is_sold and sold_date:&lt;/P&gt;
&lt;PRE&gt;CREATE MATERIALIZED VIEW inventory_sold_status AS
SELECT DISTINCT
product_id,
name,
price,
valid_from,
valid_to,
CASE
  WHEN valid_to IS NOT NULL
    AND valid_to &amp;lt; current_date() - INTERVAL 3 DAYS
  THEN TRUE
  ELSE FALSE
END AS is_sold,
CASE
  WHEN valid_to IS NOT NULL
    AND valid_to &amp;lt; current_date() - INTERVAL 3 DAYS
  THEN valid_to
  ELSE NULL
END AS sold_date
FROM inventory_scd2_with_grace&lt;/PRE&gt;
&lt;P&gt;The logic here: a product is considered sold only when it has been absent for more than 3 days (valid_to is more than 3 days in the past). Products that disappeared recently (within the grace window) are not yet marked as sold.&lt;/P&gt;
&lt;P&gt;An alternative approach that avoids the downstream stitching entirely is to "carry forward" missing rows in your snapshot before passing it to create_auto_cdc_from_snapshot_flow. You would create a view that unions the current snapshot with products from the prior 3 snapshots that are not in the current one. This way the AUTO CDC engine never sees the temporary disappearances at all. The tradeoff is that it adds complexity to your source definition but simplifies the downstream.&lt;/P&gt;
&lt;P&gt;2. BACKFILL STRATEGY&lt;/P&gt;
&lt;P&gt;For the historical backfill (Sept 2025 to today), use the next_snapshot_and_version callback with create_auto_cdc_from_snapshot_flow. This is specifically designed for replaying historical snapshots in order. Here is a pattern:&lt;/P&gt;
&lt;PRE&gt;import dlt
from pyspark.sql.functions import *

def next_snapshot_and_version(latest_version):
  from datetime import date, timedelta

  start_date = date(2025, 9, 1)

  if latest_version is None:
      next_date = start_date
  else:
      next_date = date.fromisoformat(str(latest_version)) + timedelta(days=1)

  if next_date &amp;gt; date.today():
      return None

  snapshot_path = f"s3://your-bucket/bronze/inventory/{next_date.isoformat()}/"

  try:
      df = spark.read.parquet(snapshot_path)
      return (df, next_date.isoformat())
  except Exception:
      # Skip dates with no snapshot file
      return next_snapshot_and_version(next_date.isoformat())

dlt.create_auto_cdc_from_snapshot_flow(
  target="inventory_scd2_raw",
  source=next_snapshot_and_version,
  keys=["product_id"],
  stored_as_scd_type=2,
  track_history_column_list=["name", "price", "status"]
)&lt;/PRE&gt;
&lt;P&gt;A few tips for the backfill:&lt;BR /&gt;
- Use track_history_column_list to limit which column changes generate new history rows. This keeps the SCD2 table from growing unnecessarily large.&lt;BR /&gt;
- The pipeline will process one snapshot per update cycle. For ~180 historical dates, expect multiple pipeline runs. You can set the pipeline to run in triggered mode and schedule it to run repeatedly until caught up.&lt;BR /&gt;
- After the backfill is complete, switch the source to a periodic snapshot (a live view of your current bronze table) for daily incremental processing.&lt;/P&gt;
&lt;P&gt;Documentation reference for historical snapshot ingestion:&lt;BR /&gt;
&lt;A href="https://docs.databricks.com/en/dlt/cdc.html" target="_blank"&gt;https://docs.databricks.com/en/dlt/cdc.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;3. BRONZE CLEANUP&lt;/P&gt;
&lt;P&gt;You can safely reduce Bronze retention once the SCD2 Silver layer is built, but there are important caveats:&lt;/P&gt;
&lt;P&gt;- If you do a full refresh of the Lakeflow Spark Declarative Pipeline (SDP), it drops the AUTO CDC target table and rebuilds from scratch. If Bronze only has 7 days of data, you lose all history prior to that.&lt;BR /&gt;
- To prevent accidental full refreshes, set the pipeline configuration: pipelines.reset.allowed = false. This blocks the "Full refresh" button.&lt;BR /&gt;
- Consider moving older Bronze snapshots to a cold storage tier (S3 Glacier, Azure Cool/Archive) rather than deleting them. This gives you a safety net for reprocessing at lower cost.&lt;BR /&gt;
- VACUUM on managed tables respects the delta.deletedFileRetentionDuration setting (default 7 days). You can DELETE rows older than 7 days from your Bronze table and then VACUUM to reclaim space.&lt;/P&gt;
&lt;P&gt;One more note: since you mentioned the product name changed, the feature formerly known as DLT is now called Lakeflow Spark Declarative Pipelines (SDP). The APIs (dlt.create_auto_cdc_from_snapshot_flow, etc.) remain the same, just the product branding has been updated.&lt;/P&gt;
&lt;P&gt;Documentation references:&lt;BR /&gt;
- AUTO CDC from snapshot: &lt;A href="https://docs.databricks.com/en/dlt/cdc.html" target="_blank"&gt;https://docs.databricks.com/en/dlt/cdc.html&lt;/A&gt;&lt;BR /&gt;
- Pipeline configuration: &lt;A href="https://docs.databricks.com/en/dlt/configure-pipeline.html" target="_blank"&gt;https://docs.databricks.com/en/dlt/configure-pipeline.html&lt;/A&gt;&lt;BR /&gt;
- VACUUM: &lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html" target="_blank"&gt;https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;
&lt;P&gt;If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.&lt;/P&gt;</description>
      <pubDate>Mon, 09 Mar 2026 05:02:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/architecture-advice-dlt-strategy-for-daily-snapshots-to-scd2/m-p/150318#M53354</guid>
      <dc:creator>SteveOstrowski</dc:creator>
      <dc:date>2026-03-09T05:02:35Z</dc:date>
    </item>
  </channel>
</rss>

