Databricks Community

JD18 · ‎02-16-2026

Hi there,

Im new to Databricks and trying to build a SCD2 type table using AUTO CDC approach. while it quite simple to create a scd2 table Im unable to do a backfill.

Full context.

I have raw data(order, customer info) from 2019 and creating a dimension for customer(dim_customer), when I first ran the pipeline the dim_customer is populated correctly but the __start_date was equal to the current_timestamp and __end_date is null (as expected). In this scenario the __start_date (the effective period) should not be 2026 right ? Ideally it should be when we first received the order from customer (ie somewhere in 2019). As a general approach I was trying to use the min(order date) from raw data to populate the __start_date but failed.

I tried few approaches

tried creating streaming table from notebook - failed with error cannot create from unity catalog
Created a separate pipeline for backfills - failed again with error that dim_customer is managed in separate pipeline.

So folks, Im reaching out to understand what is the standard way to solve this problem. I would appreciate if you can share any links blogs/ your experience which can help me with this.

TIA,

JD.

#databricks

#dlt

#scd2
#auto_cdc

User16502773013 · ‎02-23-2026

Hello JD,

Here is a suggested approach:

Within the single pipeline that owns dim_customer :

1. Declare the streaming table with __START_AT and __END_AT .

2. Add a backfill flow:

CREATE FLOW backfill_dim_customer AS
INSERT INTO ONCE dim_customer BY NAME
SELECT
customer_id,
...,
MIN(order_date) AS __START_AT,
CAST(NULL AS TIMESTAMP) AS __END_AT
FROM raw_orders
GROUP BY customer_id, ...;

3. Add an AUTO CDC SCD2 flow using your CDC source and a proper SEQUENCE BY column that reflects change/effective time.
That will give you SCD2 history where the first version for each customer starts in 2019 (or whenever the first order occurred), and subsequent changes will be tracked automatically going forward.

SteveOstrowski · ‎03-07-2026

Hi @JD18,

Welcome to Databricks and Thanks for raising this. SCD Type 2 backfilling with streaming tables is a common need, and the good news is that the AUTO CDC framework (formerly APPLY CHANGES INTO) has built-in capabilities to handle this -- you just need to set up your SEQUENCE BY column correctly.

THE ROOT CAUSE OF YOUR ISSUE

The reason your __START_AT values are showing 2026 (current time) instead of the historical dates from 2019 is almost certainly related to what you are using as your SEQUENCE BY column. When you define an AUTO CDC flow with STORED AS SCD TYPE 2, the framework automatically populates __START_AT and __END_AT using the values from your SEQUENCE BY column -- not current_timestamp. So if your SEQUENCE BY column contains the current processing time rather than the original business event time, that is exactly what __START_AT will reflect.

For example, if your flow looks like this:

CREATE FLOW customer_cdc AS AUTO CDC INTO dim_customer
FROM STREAM(raw_orders_cdc)
KEYS (customer_id)
SEQUENCE BY processing_timestamp
STORED AS SCD TYPE 2;

And processing_timestamp is set to when the pipeline runs, then __START_AT will be 2026. Instead, you need to use a column that reflects when the business event actually occurred.

THE FIX: USE A MEANINGFUL SEQUENCE BY COLUMN

The most straightforward solution is to use a column that represents the actual chronological ordering of changes. In your case, since you have raw order data from 2019, you likely have an order_date or a similar timestamp column. Use that as your SEQUENCE BY:

CREATE FLOW customer_cdc AS AUTO CDC INTO dim_customer
FROM STREAM(raw_orders_cdc)
KEYS (customer_id)
SEQUENCE BY order_date
COLUMNS * EXCEPT (operation)
STORED AS SCD TYPE 2;

Or in Python:

from pyspark import pipelines as dp
from pyspark.sql.functions import col

dp.create_streaming_table("dim_customer")
dp.create_auto_cdc_flow(
target="dim_customer",
source="raw_orders_cdc",
keys=["customer_id"],
sequence_by=col("order_date"),
stored_as_scd_type="2"
)

With this setup, when the pipeline processes a record with order_date = 2019-03-15, the __START_AT will be 2019-03-15, which is exactly what you want.

RECOMMENDED BACKFILL STRATEGY: ROUTE EVERYTHING THROUGH THE AUTO CDC FLOW

The recommended approach is to structure your source so that it contains both historical and ongoing changes, then let the AUTO CDC flow process everything together:

1. Prepare your source data so that every record (historical from 2019 onward, plus any ongoing changes) has a proper sequencing timestamp reflecting when the event actually happened.

2. If you need to derive the effective date from your raw orders, create a view that computes it. For example, if you want to track the first order per customer:

CREATE OR REPLACE VIEW customer_changes AS
SELECT
customer_id,
customer_name,
customer_address,
order_date AS effective_date,
'UPSERT' AS operation
FROM raw_orders;

3. Use this view as the source for your AUTO CDC flow with SEQUENCE BY effective_date.

4. Run a full refresh to reprocess everything from scratch:

REFRESH TABLE dim_customer FULL

You can also trigger a full refresh from the pipeline UI by clicking the dropdown next to Start and selecting Full Refresh All.

The AUTO CDC framework automatically handles out-of-order data. Even if historical records with older timestamps arrive after newer records, the framework reorders them based on the SEQUENCE BY column and builds the correct SCD-2 history.

IMPORTANT CONSTRAINT: FLOW TYPE MIXING

One critical point -- a streaming table that is the target of an AUTO CDC flow can ONLY be targeted by other AUTO CDC flows. You cannot mix append flows (like INSERT INTO or INSERT INTO ONCE) with AUTO CDC flows on the same target table. This is documented here: https://docs.databricks.com/en/delta-live-tables/flows.html

This means the approach of creating a separate backfill flow using INSERT INTO ONCE on your dim_customer table will not work if dim_customer is already an AUTO CDC target. Instead, route all your data (historical and ongoing) through the AUTO CDC mechanism itself.

ALTERNATIVE: AUTO CDC FROM SNAPSHOT

If your historical data comes as point-in-time snapshots (common when migrating from source databases), you can use AUTO CDC FROM SNAPSHOT. This API compares successive snapshots to determine what changed and can output SCD Type 2 history. This is Python-only:

dp.create_auto_cdc_from_snapshot_flow(
target="dim_customer",
source="customer_snapshot_source",
keys=["customer_id"],
stored_as_scd_type=2
)

More details: https://docs.databricks.com/en/delta-live-tables/cdc.html

TIPS FOR SEQUENCE BY

- The SEQUENCE BY column values must never be NULL.
- There should be one distinct update per key at each sequence value. If you have duplicates, add a tiebreaker using a STRUCT: SEQUENCE BY struct(order_date, order_id).
- For historical backfills, make sure your SEQUENCE BY values for historical records are earlier than your ongoing CDC events so the timeline is consistent.

DOCUMENTATION REFERENCES

- AUTO CDC (APPLY CHANGES) for SCD Type 2: https://docs.databricks.com/en/delta-live-tables/cdc.html
- Flows (including backfill patterns and restrictions): https://docs.databricks.com/en/delta-live-tables/flows.html
- Pipeline updates and full refresh: https://docs.databricks.com/en/delta-live-tables/updates.html
- Lakeflow Spark Declarative Pipelines (SDP) overview: https://docs.databricks.com/en/delta-live-tables/index.html

Hope this helps you get your dim_customer SCD-2 backfill working. The key takeaway is: use a SEQUENCE BY column that reflects the actual business event time (like order_date), route all your data through the AUTO CDC flow, and do a full refresh to reprocess the complete history. The framework will build the correct SCD-2 timeline automatically.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

slow · yesterday

Can we use this method on managed ingestion pipeline tables like one that is using the salesforce connection on a scdty2 table?

Databricks Community

SCD-2 backfilling with streaming tabels

The Next Wave of Enterprise AI | Webinar

🌟 Community Pulse: Your Weekly Roundup! June 29 – July 05, 2026

📌‌ Complete Your Profile – Help Others Get to Know You

Solution Accelerator Series | Identify Fraud With Geospatial Analytics and AI

Databricks Community Champion - June 2026 - Amira Bedhiafi