aleksandra_ch
Databricks Employee
Databricks Employee

Hi @samuelperezh ,

Pattern for Grace Period

The auto_cdc_from_snapshot SCD2 should be able to naturally take care of your scenario. Let's walk through an example:

1. Initial load with 2 products on 25/02/2026:

product_id name price
1 foo 10
2 bar 20
3 baz 30

The target of the auto_cdc_from_snapshot SCD2 will be:

product_id name price __START_AT __END_AT
1 foo 10 25/02/2026 NULL
2 bar 20 25/02/2026 NULL
3 baz 30 25/02/2026 NULL

2. Assume that bar and baz disappear on the load of 26/02/2026:

product_id name price
1 foo 10

The target of auto_cdc_from_snapshot will be updated to:

product_id name price __START_AT __END_AT
1 foo 10 25/02/2026 NULL
2 bar 20 25/02/2026 26/02/2026
3 baz 30 25/02/2026 26/02/2026

3. The bar reappears the next day, on 27/02/2026:

product_id name price
1 foo 10
2 bar 20

The target CDC table now reflects the reappearance:

product_id name price __START_AT __END_AT
1 foo 10 25/02/2026 NULL
2 bar 20 25/02/2026 26/02/2026
2 bar 20 27/02/2026 NULL
3 baz 30 25/02/2026 26/02/2026

The wrongly deleted data will appear as a duplicate which needs to be handled in the downstream:

  • If the new __START_AT is earlier than previous __END_AT + 3 days, we keep the record. Otherwise, it will be ignored.
  • For each identical row, we retrieve only the newest record.
CREATE MATERIALIZED VIEW target_auto_cdc_clean AS
WITH filtered AS (
  SELECT *,
    LEAD(__END_AT) OVER (PARTITION BY name, age ORDER BY __END_AT DESC NULLS FIRST) AS prev_end_at 
  FROM names_cdc
),
valid AS (
  SELECT *
  FROM filtered
  WHERE prev_end_at IS NULL 
     OR (
          prev_end_at IS NOT NULL 
          AND __START_AT <= prev_end_at + INTERVAL 3 DAYS 
        )
),
ranked AS (
  SELECT *, row_number() OVER (PARTITION BY name, age ORDER BY __START_AT DESC) AS rn 
  FROM valid
)
SELECT * FROM ranked WHERE rn = 1

This will return:

product_id name price __START_AT __END_AT
1 foo 10 25/02/2026 NULL
2 bar 20 27/02/2026 NULL
3 baz 30 25/02/2026 26/02/2026

Then, to apply the grace logic:

SELECT *, 
    CASE 
        WHEN __END_AT > current_date() + INTERVAL 3 DAYS THEN TRUE
        ELSE FALSE 
    END as is_sold,
    CASE
        WHEN __END_AT > current_date() + INTERVAL 3 DAYS THEN __END_AT
        ELSE NULL END as sold_date
FROM target_auto_cdc_clean
Feel free to double check the queries to make sure they follow the business logic.
 
2. Backfill strategy. You can leverage the next_snapshot_and_version function to loop over the past dates to perform historical backfilling. However, this would be an expensive operation if you have a lot of history. There is a tradeoff between the history depth in the SCD2 table vs time and cost spent to build it. 
 
3. Bronze Cleanup. Be mindful that, if you delete the data from the bronze tables, you won't be able to recompute the whole history. For example, if you hit Run pipeline with full table refresh, it will drop the AUTO CDC target table and start from scratch. If only last 7 days of data are present in the bronze layer, the new fully refreshed AUTO CDC target will contain only the last 7 days. You can move the old bronze data into a cold storage to reduce costs. Retrieve it whenever a full refresh is needed. This is again a tradeoff between cost and history depth. You can also prevent full table refresh at all by setting pipelines.reset.allowed = false in the pipeline.
 
Hope it helps.
 
Best regards,