Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-25-2026 03:44 AM
Hi @samuelperezh ,
Pattern for Grace Period
The auto_cdc_from_snapshot SCD2 should be able to naturally take care of your scenario. Let's walk through an example:
1. Initial load with 2 products on 25/02/2026:
product_id |
name |
price |
|---|---|---|
| 1 | foo | 10 |
| 2 | bar | 20 |
| 3 | baz | 30 |
The target of the auto_cdc_from_snapshot SCD2 will be:
product_id |
name |
price |
__START_AT |
__END_AT |
|---|---|---|---|---|
| 1 | foo | 10 | 25/02/2026 | NULL |
| 2 | bar | 20 | 25/02/2026 | NULL |
| 3 | baz | 30 | 25/02/2026 | NULL |
2. Assume that bar and baz disappear on the load of 26/02/2026:
product_id |
name |
price |
|---|---|---|
| 1 | foo | 10 |
The target of auto_cdc_from_snapshot will be updated to:
product_id |
name |
price |
__START_AT |
__END_AT |
|---|---|---|---|---|
| 1 | foo | 10 | 25/02/2026 | NULL |
| 2 | bar | 20 | 25/02/2026 | 26/02/2026 |
| 3 | baz | 30 | 25/02/2026 | 26/02/2026 |
3. The bar reappears the next day, on 27/02/2026:
product_id |
name |
price |
|---|---|---|
| 1 | foo | 10 |
| 2 | bar | 20 |
The target CDC table now reflects the reappearance:
product_id |
name |
price |
__START_AT |
__END_AT |
|---|---|---|---|---|
| 1 | foo | 10 | 25/02/2026 | NULL |
| 2 | bar | 20 | 25/02/2026 | 26/02/2026 |
| 2 | bar | 20 | 27/02/2026 | NULL |
| 3 | baz | 30 | 25/02/2026 | 26/02/2026 |
The wrongly deleted data will appear as a duplicate which needs to be handled in the downstream:
- If the new
__START_ATis earlier thanprevious __END_AT + 3 days, we keep the record. Otherwise, it will be ignored. - For each identical row, we retrieve only the
newestrecord.
CREATE MATERIALIZED VIEW target_auto_cdc_clean AS
WITH filtered AS (
SELECT *,
LEAD(__END_AT) OVER (PARTITION BY name, age ORDER BY __END_AT DESC NULLS FIRST) AS prev_end_at
FROM names_cdc
),
valid AS (
SELECT *
FROM filtered
WHERE prev_end_at IS NULL
OR (
prev_end_at IS NOT NULL
AND __START_AT <= prev_end_at + INTERVAL 3 DAYS
)
),
ranked AS (
SELECT *, row_number() OVER (PARTITION BY name, age ORDER BY __START_AT DESC) AS rn
FROM valid
)
SELECT * FROM ranked WHERE rn = 1
This will return:
product_id |
name |
price |
__START_AT |
__END_AT |
|---|---|---|---|---|
| 1 | foo | 10 | 25/02/2026 | NULL |
| 2 | bar | 20 | 27/02/2026 | NULL |
| 3 | baz | 30 | 25/02/2026 | 26/02/2026 |
Then, to apply the grace logic:
SELECT *,
CASE
WHEN __END_AT > current_date() + INTERVAL 3 DAYS THEN TRUE
ELSE FALSE
END as is_sold,
CASE
WHEN __END_AT > current_date() + INTERVAL 3 DAYS THEN __END_AT
ELSE NULL END as sold_date
FROM target_auto_cdc_clean
Feel free to double check the queries to make sure they follow the business logic.
2. Backfill strategy. You can leverage the
next_snapshot_and_version function to loop over the past dates to perform historical backfilling. However, this would be an expensive operation if you have a lot of history. There is a tradeoff between the history depth in the SCD2 table vs time and cost spent to build it. 3. Bronze Cleanup. Be mindful that, if you delete the data from the bronze tables, you won't be able to recompute the whole history. For example, if you hit
Run pipeline with full table refresh, it will drop the AUTO CDC target table and start from scratch. If only last 7 days of data are present in the bronze layer, the new fully refreshed AUTO CDC target will contain only the last 7 days. You can move the old bronze data into a cold storage to reduce costs. Retrieve it whenever a full refresh is needed. This is again a tradeoff between cost and history depth. You can also prevent full table refresh at all by setting pipelines.reset.allowed = false in the pipeline.Hope it helps.
Best regards,