Architecture Advice: DLT Strategy for Daily Snapshots to SCD2 with "Grace Period" Deletes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-18-2026 03:46 PM
Hi everyone,
I’m looking for architectural advice on building a Silver layer in DLT. I am dealing with inventory data and need to handle historical tracking, "sold" logic based on disappearance, and storage cost optimization.
Here's how the situation looks like:
Source (Bronze): I receive a full snapshot of the inventory every day (~60GB/day).
History: I have raw snapshots sitting in object storage starting from September 2025 (it's already at around 7TB).
Goal (Silver): I need an SCD Type 2 table to track changes in attributes (price, status) and determine when a product is sold.
These are the challenges I'm having:
"Sold" Logic with Grace Period:
Since the source is a full snapshot, a missing product id implies it was sold. However, data quality issues mean products sometimes disappear for 1-2 days and then reappear.I need to implement a 3-day grace period.
If a product id is missing for >3 consecutive days, mark it as is_sold = True and set sold_date to the first day it went missing.
If it reappears within 3 days, it should remain active (ignore the gap).
Backfill + Incremental:
I need to process the history (Sept 2025 to Today) sequentially to build the SCD2 history correctly, and then switch to daily incremental processing for new snapshots.Storage Costs:
My cloud storage costs are high ($300+/mo) because I am retaining daily 60GB full snapshots in Bronze indefinitely. Once the data is processed into the SCD2 Silver layer, I want to aggressively clean up the Bronze table/files and keep only 7 days.
I am looking at using dlt.create_auto_cdc_from_snapshot_flow. However, I'm unsure how to inject the Grace Period logic into the native CDC flow, as the native function tends to mark deletes immediately upon disappearance.
My Questions:
Pattern for Grace Period: Is it better to stick with create_auto_cdc_from_snapshot_flow and try to manipulate the snapshot frame before passing it to the CDC engine like carrying forward missing rows for 3 days? Or should I write a custom apply_changes logic that manages a "consecutive_missing_days" counter manually?
Backfill Strategy: For the initial run, should I create a logic that loops through all historical dates in order within the DLT definition? Or is there a better way to "replay" history into an SCD2 table?
Bronze Cleanup: Since Silver will contain the full history (SCD2), is it safe/recommended to run a daily VACUUM and DELETE on the Bronze table to keep only the last 7 days of snapshots? I want to ensure this doesn't break the DLT lineage or backfill capabilities if I ever need to recompute from recent history.
Any snippets or patterns for handling this problem on full snapshots would be greatly appreciated!