Re: Architecture Advice: DLT Strategy for Daily Sn...

aleksandra_ch · ‎02-25-2026

Pattern for Grace Period

The auto_cdc_from_snapshot SCD2 should be able to naturally take care of your scenario. Let's walk through an example:

1. Initial load with 2 products on 25/02/2026:

`product_id`	`name`	`price`
1	foo	10
2	bar	20
3	baz	30

The target of the auto_cdc_from_snapshot SCD2 will be:

`product_id`	`name`	`price`	`__START_AT`	`__END_AT`
1	foo	10	25/02/2026	NULL
2	bar	20	25/02/2026	NULL
3	baz	30	25/02/2026	NULL

2. Assume that bar and baz disappear on the load of 26/02/2026:

`product_id`	`name`	`price`
1	foo	10

The target of auto_cdc_from_snapshot will be updated to:

`product_id`	`name`	`price`	`__START_AT`	`__END_AT`
1	foo	10	25/02/2026	NULL
2	bar	20	25/02/2026	26/02/2026
3	baz	30	25/02/2026	26/02/2026

3. The bar reappears the next day, on 27/02/2026:

`product_id`	`name`	`price`
1	foo	10
2	bar	20

The target CDC table now reflects the reappearance:

`product_id`	`name`	`price`	`__START_AT`	`__END_AT`
1	foo	10	25/02/2026	NULL
2	bar	20	25/02/2026	26/02/2026
2	bar	20	27/02/2026	NULL
3	baz	30	25/02/2026	26/02/2026

The wrongly deleted data will appear as a duplicate which needs to be handled in the downstream:

If the new __START_AT is earlier than previous __END_AT + 3 days, we keep the record. Otherwise, it will be ignored.
For each identical row, we retrieve only the newest record.

CREATE MATERIALIZED VIEW target_auto_cdc_clean AS
WITH filtered AS (
  SELECT *,
    LEAD(__END_AT) OVER (PARTITION BY name, age ORDER BY __END_AT DESC NULLS FIRST) AS prev_end_at 
  FROM names_cdc
),
valid AS (
  SELECT *
  FROM filtered
  WHERE prev_end_at IS NULL 
     OR (
          prev_end_at IS NOT NULL 
          AND __START_AT <= prev_end_at + INTERVAL 3 DAYS 
        )
),
ranked AS (
  SELECT *, row_number() OVER (PARTITION BY name, age ORDER BY __START_AT DESC) AS rn 
  FROM valid
)
SELECT * FROM ranked WHERE rn = 1

This will return:

`product_id`	`name`	`price`	`__START_AT`	`__END_AT`
1	foo	10	25/02/2026	NULL
2	bar	20	27/02/2026	NULL
3	baz	30	25/02/2026	26/02/2026

Then, to apply the grace logic:

SELECT *, 
    CASE 
        WHEN __END_AT > current_date() + INTERVAL 3 DAYS THEN TRUE
        ELSE FALSE 
    END as is_sold,
    CASE
        WHEN __END_AT > current_date() + INTERVAL 3 DAYS THEN __END_AT
        ELSE NULL END as sold_date
FROM target_auto_cdc_clean

Feel free to double check the queries to make sure they follow the business logic.

2. Backfill strategy. You can leverage the next_snapshot_and_version function to loop over the past dates to perform historical backfilling. However, this would be an expensive operation if you have a lot of history. There is a tradeoff between the history depth in the SCD2 table vs time and cost spent to build it.

3. Bronze Cleanup. Be mindful that, if you delete the data from the bronze tables, you won't be able to recompute the whole history. For example, if you hit Run pipeline with full table refresh, it will drop the AUTO CDC target table and start from scratch. If only last 7 days of data are present in the bronze layer, the new fully refreshed AUTO CDC target will contain only the last 7 days. You can move the old bronze data into a cold storage to reduce costs. Retrieve it whenever a full refresh is needed. This is again a tradeoff between cost and history depth. You can also prevent full table refresh at all by setting pipelines.reset.allowed = false in the pipeline.

Hope it helps.

Best regards,