cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

SCD-2 backfilling with streaming tabels

JD18
New Contributor

Hi there,

Im new to Databricks and trying to build a SCD2 type table using AUTO CDC approach. while it quite simple to create a scd2 table Im unable to do a backfill.

Full context.

I have raw data(order, customer info) from 2019 and creating a dimension for customer(dim_customer), when I first ran the pipeline the dim_customer is populated correctly but the __start_date was equal to the current_timestamp and __end_date is null (as expected). In this scenario the __start_date (the effective period) should not be 2026 right ? Ideally it should be when we first received the order from customer (ie somewhere in 2019). As a general approach I was trying to use the min(order date) from raw data to populate the __start_date but failed.

I tried few approaches 

  1. tried creating streaming table from notebook - failed with error cannot create from unity catalog
  2. Created a separate pipeline for  backfills - failed again with error that dim_customer is managed in separate pipeline.

So folks, Im reaching out to understand what is the standard way to solve this problem. I would appreciate if you can share any links blogs/ your experience which can help me with this.

TIA,

JD.

#databricks

#dlt

#scd2
#auto_cdc

2 REPLIES 2

User16502773013
Databricks Employee
Databricks Employee

Hello JD,

Here is a suggested approach:

Within the single pipeline that owns  dim_customer :
1. Declare the streaming table with  __START_AT  and  __END_AT .
2. Add a backfill flow:
CREATE FLOW backfill_dim_customer AS
INSERT INTO ONCE dim_customer BY NAME
SELECT
customer_id,
...,
MIN(order_date) AS __START_AT,
CAST(NULL AS TIMESTAMP) AS __END_AT
FROM raw_orders
GROUP BY customer_id, ...;
3. Add an AUTO CDC SCD2 flow using your CDC source and a proper  SEQUENCE BY  column that reflects change/effective time.
That will give you SCD2 history where the first version for each customer starts in 2019 (or whenever the first order occurred), and subsequent changes will be tracked automatically going forward.

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @JD18,

Welcome to Databricks and Thanks for raising this. SCD Type 2 backfilling with streaming tables is a common need, and the good news is that the AUTO CDC framework (formerly APPLY CHANGES INTO) has built-in capabilities to handle this -- you just need to set up your SEQUENCE BY column correctly.


THE ROOT CAUSE OF YOUR ISSUE

The reason your __START_AT values are showing 2026 (current time) instead of the historical dates from 2019 is almost certainly related to what you are using as your SEQUENCE BY column. When you define an AUTO CDC flow with STORED AS SCD TYPE 2, the framework automatically populates __START_AT and __END_AT using the values from your SEQUENCE BY column -- not current_timestamp. So if your SEQUENCE BY column contains the current processing time rather than the original business event time, that is exactly what __START_AT will reflect.

For example, if your flow looks like this:

CREATE FLOW customer_cdc AS AUTO CDC INTO dim_customer
FROM STREAM(raw_orders_cdc)
KEYS (customer_id)
SEQUENCE BY processing_timestamp
STORED AS SCD TYPE 2;

And processing_timestamp is set to when the pipeline runs, then __START_AT will be 2026. Instead, you need to use a column that reflects when the business event actually occurred.


THE FIX: USE A MEANINGFUL SEQUENCE BY COLUMN

The most straightforward solution is to use a column that represents the actual chronological ordering of changes. In your case, since you have raw order data from 2019, you likely have an order_date or a similar timestamp column. Use that as your SEQUENCE BY:

CREATE FLOW customer_cdc AS AUTO CDC INTO dim_customer
FROM STREAM(raw_orders_cdc)
KEYS (customer_id)
SEQUENCE BY order_date
COLUMNS * EXCEPT (operation)
STORED AS SCD TYPE 2;

Or in Python:

from pyspark import pipelines as dp
from pyspark.sql.functions import col

dp.create_streaming_table("dim_customer")
dp.create_auto_cdc_flow(
target="dim_customer",
source="raw_orders_cdc",
keys=["customer_id"],
sequence_by=col("order_date"),
stored_as_scd_type="2"
)

With this setup, when the pipeline processes a record with order_date = 2019-03-15, the __START_AT will be 2019-03-15, which is exactly what you want.


RECOMMENDED BACKFILL STRATEGY: ROUTE EVERYTHING THROUGH THE AUTO CDC FLOW

The recommended approach is to structure your source so that it contains both historical and ongoing changes, then let the AUTO CDC flow process everything together:

1. Prepare your source data so that every record (historical from 2019 onward, plus any ongoing changes) has a proper sequencing timestamp reflecting when the event actually happened.

2. If you need to derive the effective date from your raw orders, create a view that computes it. For example, if you want to track the first order per customer:

CREATE OR REPLACE VIEW customer_changes AS
SELECT
customer_id,
customer_name,
customer_address,
order_date AS effective_date,
'UPSERT' AS operation
FROM raw_orders;

3. Use this view as the source for your AUTO CDC flow with SEQUENCE BY effective_date.

4. Run a full refresh to reprocess everything from scratch:

REFRESH TABLE dim_customer FULL

You can also trigger a full refresh from the pipeline UI by clicking the dropdown next to Start and selecting Full Refresh All.

The AUTO CDC framework automatically handles out-of-order data. Even if historical records with older timestamps arrive after newer records, the framework reorders them based on the SEQUENCE BY column and builds the correct SCD-2 history.


IMPORTANT CONSTRAINT: FLOW TYPE MIXING

One critical point -- a streaming table that is the target of an AUTO CDC flow can ONLY be targeted by other AUTO CDC flows. You cannot mix append flows (like INSERT INTO or INSERT INTO ONCE) with AUTO CDC flows on the same target table. This is documented here: https://docs.databricks.com/en/delta-live-tables/flows.html

This means the approach of creating a separate backfill flow using INSERT INTO ONCE on your dim_customer table will not work if dim_customer is already an AUTO CDC target. Instead, route all your data (historical and ongoing) through the AUTO CDC mechanism itself.


ALTERNATIVE: AUTO CDC FROM SNAPSHOT

If your historical data comes as point-in-time snapshots (common when migrating from source databases), you can use AUTO CDC FROM SNAPSHOT. This API compares successive snapshots to determine what changed and can output SCD Type 2 history. This is Python-only:

dp.create_auto_cdc_from_snapshot_flow(
target="dim_customer",
source="customer_snapshot_source",
keys=["customer_id"],
stored_as_scd_type=2
)

More details: https://docs.databricks.com/en/delta-live-tables/cdc.html


TIPS FOR SEQUENCE BY

- The SEQUENCE BY column values must never be NULL.
- There should be one distinct update per key at each sequence value. If you have duplicates, add a tiebreaker using a STRUCT: SEQUENCE BY struct(order_date, order_id).
- For historical backfills, make sure your SEQUENCE BY values for historical records are earlier than your ongoing CDC events so the timeline is consistent.


DOCUMENTATION REFERENCES

- AUTO CDC (APPLY CHANGES) for SCD Type 2: https://docs.databricks.com/en/delta-live-tables/cdc.html
- Flows (including backfill patterns and restrictions): https://docs.databricks.com/en/delta-live-tables/flows.html
- Pipeline updates and full refresh: https://docs.databricks.com/en/delta-live-tables/updates.html
- Lakeflow Spark Declarative Pipelines (SDP) overview: https://docs.databricks.com/en/delta-live-tables/index.html

Hope this helps you get your dim_customer SCD-2 backfill working. The key takeaway is: use a SEQUENCE BY column that reflects the actual business event time (like order_date), route all your data through the AUTO CDC flow, and do a full refresh to reprocess the complete history. The framework will build the correct SCD-2 timeline automatically.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.