Architecture Advice: DLT Strategy for Daily Snapshots to SCD2 with "Grace Period" Deletes

samuelperezh
New Contributor

Hi everyone,

I’m looking for architectural advice on building a Silver layer in DLT. I am dealing with inventory data and need to handle historical tracking, "sold" logic based on disappearance, and storage cost optimization.

Here's how the situation looks like:

  • Source (Bronze): I receive a full snapshot of the inventory every day (~60GB/day).

  • History: I have raw snapshots sitting in object storage starting from September 2025 (it's already at around 7TB).

  • Goal (Silver): I need an SCD Type 2 table to track changes in attributes (price, status) and determine when a product is sold.

These are the challenges I'm having:

  1. "Sold" Logic with Grace Period:
    Since the source is a full snapshot, a missing product id implies it was sold. However, data quality issues mean products sometimes disappear for 1-2 days and then reappear.

    • I need to implement a 3-day grace period.

    • If a product id is missing for >3 consecutive days, mark it as is_sold = True and set sold_date to the first day it went missing.

    • If it reappears within 3 days, it should remain active (ignore the gap).

  2. Backfill + Incremental:
    I need to process the history (Sept 2025 to Today) sequentially to build the SCD2 history correctly, and then switch to daily incremental processing for new snapshots.

  3. Storage Costs:
    My cloud storage costs are high ($300+/mo) because I am retaining daily 60GB full snapshots in Bronze indefinitely. Once the data is processed into the SCD2 Silver layer, I want to aggressively clean up the Bronze table/files and keep only 7 days.

I am looking at using dlt.create_auto_cdc_from_snapshot_flow. However, I'm unsure how to inject the Grace Period logic into the native CDC flow, as the native function tends to mark deletes immediately upon disappearance.

My Questions:

  1. Pattern for Grace Period: Is it better to stick with create_auto_cdc_from_snapshot_flow and try to manipulate the snapshot frame before passing it to the CDC engine like carrying forward missing rows for 3 days? Or should I write a custom apply_changes logic that manages a "consecutive_missing_days" counter manually?

  2. Backfill Strategy: For the initial run, should I create a logic that loops through all historical dates in order within the DLT definition? Or is there a better way to "replay" history into an SCD2 table?

  3. Bronze Cleanup: Since Silver will contain the full history (SCD2), is it safe/recommended to run a daily VACUUM and DELETE on the Bronze table to keep only the last 7 days of snapshots? I want to ensure this doesn't break the DLT lineage or backfill capabilities if I ever need to recompute from recent history.

Any snippets or patterns for handling this problem on full snapshots would be greatly appreciated!