Hi @JD18,
Welcome to Databricks and Thanks for raising this. SCD Type 2 backfilling with streaming tables is a common need, and the good news is that the AUTO CDC framework (formerly APPLY CHANGES INTO) has built-in capabilities to handle this -- you just need to set up your SEQUENCE BY column correctly.
THE ROOT CAUSE OF YOUR ISSUE
The reason your __START_AT values are showing 2026 (current time) instead of the historical dates from 2019 is almost certainly related to what you are using as your SEQUENCE BY column. When you define an AUTO CDC flow with STORED AS SCD TYPE 2, the framework automatically populates __START_AT and __END_AT using the values from your SEQUENCE BY column -- not current_timestamp. So if your SEQUENCE BY column contains the current processing time rather than the original business event time, that is exactly what __START_AT will reflect.
For example, if your flow looks like this:
CREATE FLOW customer_cdc AS AUTO CDC INTO dim_customer
FROM STREAM(raw_orders_cdc)
KEYS (customer_id)
SEQUENCE BY processing_timestamp
STORED AS SCD TYPE 2;
And processing_timestamp is set to when the pipeline runs, then __START_AT will be 2026. Instead, you need to use a column that reflects when the business event actually occurred.
THE FIX: USE A MEANINGFUL SEQUENCE BY COLUMN
The most straightforward solution is to use a column that represents the actual chronological ordering of changes. In your case, since you have raw order data from 2019, you likely have an order_date or a similar timestamp column. Use that as your SEQUENCE BY:
CREATE FLOW customer_cdc AS AUTO CDC INTO dim_customer
FROM STREAM(raw_orders_cdc)
KEYS (customer_id)
SEQUENCE BY order_date
COLUMNS * EXCEPT (operation)
STORED AS SCD TYPE 2;
Or in Python:
from pyspark import pipelines as dp
from pyspark.sql.functions import col
dp.create_streaming_table("dim_customer")
dp.create_auto_cdc_flow(
target="dim_customer",
source="raw_orders_cdc",
keys=["customer_id"],
sequence_by=col("order_date"),
stored_as_scd_type="2"
)
With this setup, when the pipeline processes a record with order_date = 2019-03-15, the __START_AT will be 2019-03-15, which is exactly what you want.
RECOMMENDED BACKFILL STRATEGY: ROUTE EVERYTHING THROUGH THE AUTO CDC FLOW
The recommended approach is to structure your source so that it contains both historical and ongoing changes, then let the AUTO CDC flow process everything together:
1. Prepare your source data so that every record (historical from 2019 onward, plus any ongoing changes) has a proper sequencing timestamp reflecting when the event actually happened.
2. If you need to derive the effective date from your raw orders, create a view that computes it. For example, if you want to track the first order per customer:
CREATE OR REPLACE VIEW customer_changes AS
SELECT
customer_id,
customer_name,
customer_address,
order_date AS effective_date,
'UPSERT' AS operation
FROM raw_orders;
3. Use this view as the source for your AUTO CDC flow with SEQUENCE BY effective_date.
4. Run a full refresh to reprocess everything from scratch:
REFRESH TABLE dim_customer FULL
You can also trigger a full refresh from the pipeline UI by clicking the dropdown next to Start and selecting Full Refresh All.
The AUTO CDC framework automatically handles out-of-order data. Even if historical records with older timestamps arrive after newer records, the framework reorders them based on the SEQUENCE BY column and builds the correct SCD-2 history.
IMPORTANT CONSTRAINT: FLOW TYPE MIXING
One critical point -- a streaming table that is the target of an AUTO CDC flow can ONLY be targeted by other AUTO CDC flows. You cannot mix append flows (like INSERT INTO or INSERT INTO ONCE) with AUTO CDC flows on the same target table. This is documented here: https://docs.databricks.com/en/delta-live-tables/flows.html
This means the approach of creating a separate backfill flow using INSERT INTO ONCE on your dim_customer table will not work if dim_customer is already an AUTO CDC target. Instead, route all your data (historical and ongoing) through the AUTO CDC mechanism itself.
ALTERNATIVE: AUTO CDC FROM SNAPSHOT
If your historical data comes as point-in-time snapshots (common when migrating from source databases), you can use AUTO CDC FROM SNAPSHOT. This API compares successive snapshots to determine what changed and can output SCD Type 2 history. This is Python-only:
dp.create_auto_cdc_from_snapshot_flow(
target="dim_customer",
source="customer_snapshot_source",
keys=["customer_id"],
stored_as_scd_type=2
)
More details: https://docs.databricks.com/en/delta-live-tables/cdc.html
TIPS FOR SEQUENCE BY
- The SEQUENCE BY column values must never be NULL.
- There should be one distinct update per key at each sequence value. If you have duplicates, add a tiebreaker using a STRUCT: SEQUENCE BY struct(order_date, order_id).
- For historical backfills, make sure your SEQUENCE BY values for historical records are earlier than your ongoing CDC events so the timeline is consistent.
DOCUMENTATION REFERENCES
- AUTO CDC (APPLY CHANGES) for SCD Type 2: https://docs.databricks.com/en/delta-live-tables/cdc.html
- Flows (including backfill patterns and restrictions): https://docs.databricks.com/en/delta-live-tables/flows.html
- Pipeline updates and full refresh: https://docs.databricks.com/en/delta-live-tables/updates.html
- Lakeflow Spark Declarative Pipelines (SDP) overview: https://docs.databricks.com/en/delta-live-tables/index.html
Hope this helps you get your dim_customer SCD-2 backfill working. The key takeaway is: use a SEQUENCE BY column that reflects the actual business event time (like order_date), route all your data through the AUTO CDC flow, and do a full refresh to reprocess the complete history. The framework will build the correct SCD-2 timeline automatically.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.