Every data warehouse has the same foundational problem: keeping dimension tables in sync with operational systems. Customer records change. Orders get updated. Accounts get closed. Getting those changes into your warehouse -- correctly, reliably, and in the right order -- is the most critical pipeline pattern in any data platform. And yet, the tool most teams reach for -- MERGE INTO -- was never designed for this job. It does not understand event ordering, deduplication, or delete propagation. Engineers build all of that themselves, and what starts as "just merge the new records" spirals into 150+ lines of procedural SQL. Every team reimplements this pattern. Some get it subtly wrong. The failures are silent -- you find out weeks later when a dashboard is wrong or an audit fails.
AUTO CDC is a new declarative SQL API in Databricks SQL that handles all of this for you. You declare your keys, your ordering column, and your SCD type -- and AUTO CDC handles dedup, ordering, deletes, checkpointing, and history tracking automatically. What used to take 150+ lines of fragile MERGE logic becomes about 7 lines of SQL. AUTO CDC is available now in Beta -- you can try it today in the Databricks SQL Query Editor, running directly on your SQL warehouse.

In this post, we will walk through the pain points that every warehouse engineer knows too well, show how AUTO CDC eliminates them, and build a working SCD Type 2 pipeline from scratch.
The Challenge: MERGE INTO creates Technical Debt
If you have built Change Data Capture (CDC) pipelines with MERGE INTO, these scenarios will feel familiar:
- Multiple updates to the same key in one batch -- Your source sends two updates for customer_id = 42 in the same load. Which one wins? Without explicit dedup logic, the answer is "whichever the query optimizer picks" -- and that is not deterministic.
- Out-of-order records across batches -- A record with updated_at = 10:05 AM arrives in batch 3, but the record with updated_at = 10:02 AM arrives in batch 4. Your MERGE happily overwrites the newer data with the older data. Silently.
- Delete handling -- You process a delete for customer_id = 99, then a stale update for the same customer arrives in the next batch. Your MERGE re-inserts the record. The customer is back from the dead.
- SCD Type 2 on top of all that -- Now layer in historical tracking. You need checksums to detect real changes, CTEs to stage the merge, LAG/LEAD window functions to compute effective dates, and a MERGE statement that handles matched-updates, matched-closes, and not-matched-inserts all at once.
- Error recovery -- Something fails mid-pipeline. The recovery plan? Truncate the target table and rerun from the beginning. Hope it finishes this time.
Here is a simplified glimpse of what that procedural approach looks like, and this is the short version!
Procedural SCD Type 2: The "short" version (~30 lines, real-world is 150+)
Step 1: Stage incoming data, dedup by key, keep latest
CREATE OR REPLACE TEMP VIEW staged AS
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY updated_at DESC
) AS rn
FROM customer_updates
) WHERE rn = 1;
|
Step 2: Compute checksums to detect real changes
CREATE OR REPLACE TEMP VIEW changes AS
SELECT s.*, md5(concat_ws('|', s.name, s.email, s.city)) AS new_hash
FROM staged s
JOIN customer_dim d ON s.customer_id = d.customer_id AND d.is_current = true
WHERE md5(concat_ws('|', d.name, d.email, d.city)) != md5(concat_ws('|', s.name, s.email, s.city));
|
Step 3: Close existing records + insert new versions
MERGE INTO customer_dim AS target
USING (
SELECT customer_id, name, email, city, updated_at,
true AS is_current, updated_at AS effective_from, NULL AS effective_to
FROM changes
UNION ALL
SELECT d.customer_id, d.name, d.email, d.city, d.updated_at,
false, d.effective_from, c.updated_at
FROM customer_dim d JOIN changes c ON d.customer_id = c.customer_id AND d.is_current = true
) AS source
ON target.customer_id = source.customer_id AND target.is_current = source.is_current
WHEN MATCHED AND source.is_current = false THEN UPDATE SET
is_current = false, effective_to = source.effective_to
WHEN NOT MATCHED THEN INSERT *;
|
And this still does not handle out-of-order records, deletes, or error recovery. Every one of those requires more code, more edge cases, and more testing.
How AUTO CDC Works
Instead of writing procedural logic to handle every edge case, you declare what you want and AUTO CDC figures out how:
- Out-of-order records -- SEQUENCE BY ensures records are processed in the correct order, regardless of when they arrive. Late-arriving records are slotted into the right position automatically.
- Delete propagation -- APPLY AS DELETE WHEN lets you declaratively specify which records represent deletes. No more ghost records.
- SCD Type 1 & Type 2 -- A single keyword: STORED AS SCD TYPE 1 keeps only the latest version. STORED AS SCD TYPE 2 maintains full history with __START_AT and __END_AT columns -- no checksums, no window functions, no CTEs.
- Partial updates -- IGNORE NULL UPDATES prevents null values in sparse CDC records from overwriting existing data -- a common headache with Debezium and other CDC connectors that send partial rows.
- Selective history tracking -- TRACK HISTORY ON lets you control which columns trigger a new SCD Type 2 history row. Changes to untracked columns (e.g., last_login_at) update the current row in place instead of creating unnecessary version bloat.
- Error recovery -- Built-in checkpointing means the pipeline picks up exactly where it left off. No truncate-and-rerun.
Here is how those pain points map across approaches:
|
Pain Point
|
Procedural MERGE
|
AUTO CDC
|
|
Dedup / multiple updates per key
|
ROW_NUMBER() + manual dedup CTE
|
Handled automatically
|
|
Out-of-order records
|
Complex CASE logic or ignored entirely
|
SEQUENCE BY
|
|
Delete handling
|
Manual MERGE with delete flags
|
APPLY AS DELETE WHEN
|
|
SCD Type 2 history
|
Checksums + LAG/LEAD + multi-arm MERGE
|
STORED AS SCD TYPE 2
|
|
Error recovery
|
Truncate and rerun
|
Automatic checkpointing
|
|
Partial / sparse updates
|
COALESCE() on every column
|
IGNORE NULL UPDATES
|
|
Selective history tracking
|
Custom checksums on selected columns
|
TRACK HISTORY ON
|
|
Lines of code
|
150+
|
~7
|
AUTO CDC Syntax at a Glance
The syntax is built around a few intuitive building blocks:
- `KEYS` -- The business key(s) that uniquely identify a record
- `SEQUENCE BY` -- The column that determines record order (timestamp, sequence number, etc.)
- `STORED AS SCD TYPE 1|2` -- Whether to keep only the latest version or full history (default: SCD Type 1)
- `APPLY AS DELETE WHEN` -- A condition that marks records as deletes
SCD Type 1 is the default, and that is a deliberate design choice. The most common CDC problem is deceptively simple: keep a "golden record" table that always reflects the latest state of each entity. Think of a customer dimension that powers your dashboards and reports -- you do not need every historical version, you just need the current truth. But as we saw earlier, even this "simple" case gets complicated fast with MERGE INTO when you factor in duplicate keys, out-of-order arrivals, and delete handling. AUTO CDC makes the common case effortless.
Here is what SCD Type 1 looks like -- keep only the latest version of each booking:
CREATE OR REFRESH STREAMING TABLE bookings_current
FLOW AUTO CDC
FROM STREAM samples.wanderbricks.booking_updates
KEYS (booking_id)
SEQUENCE BY updated_at
STORED AS SCD TYPE 1;
|
Seven lines. No dedup logic. No out-of-order handling. No error recovery code. It just works.
But sometimes the current state is not enough. Regulatory audits need to know what a customer's address was on a specific date. Finance teams need to reconstruct metrics as they existed at quarter-close. Fraud investigations need to trace every change to an account. That is when you upgrade from the default to SCD Type 2 -- and the only change is one line:
CREATE OR REFRESH STREAMING TABLE bookings_history
FLOW AUTO CDC
FROM STREAM samples.wanderbricks.booking_updates
KEYS (booking_id)
SEQUENCE BY updated_at
STORED AS SCD TYPE 2;
|
The only difference from the default SCD Type 1 is the last line. AUTO CDC automatically adds __START_AT and __END_AT columns to track when each version of a record was active. The current version has __END_AT = NULL. You start with Type 1 as the default, and when the business need arises, upgrading to full history is a one-line change -- not a rewrite.
Want to avoid version bloat? Use TRACK HISTORY ON to control which columns trigger new history rows. For example, if changes to city should not create a new version:
CREATE OR REFRESH STREAMING TABLE bookings_history_selective
FLOW AUTO CDC
FROM STREAM samples.wanderbricks.booking_updates
KEYS (booking_id)
SEQUENCE BY updated_at
STORED AS SCD TYPE 2
TRACK HISTORY ON * EXCEPT (city);
|
With this configuration, changes to city update the current row in place, while changes to any other column create a new historical version. This keeps your history table focused on the changes that actually matter to your business.
Building an SCD Type 2 Model in your Data Warehouse
Let's build a complete example from scratch using the Databricks SQL Query Editor connected to a SQL warehouse. Every step below runs directly in the Query Editor -- no notebooks, no pipeline UI, no Python. We will create a customer table with Change Data Feed (CDF) enabled, simulate real-world changes -- inserts, updates, and deletes -- and watch AUTO CDC handle all of it declaratively.
Step 1: Create a source table with Change Data Feed enabled
Change Data Feed captures row-level changes on a Delta table. This is our CDC source.
CREATE OR REPLACE TABLE mycatalog.default.customers (
customer_id INT,
first_name STRING,
last_name STRING,
email STRING,
city STRING
)
TBLPROPERTIES (delta.enableChangeDataFeed = true);
|
Step 2: Insert initial records
Load 10 customers as our starting dataset:
INSERT INTO mycatalog.default.customers VALUES
(1, 'Alice', 'Johnson', 'alice.johnson@example.com', 'Seattle'),
(2, 'Bob', 'Smith', 'bob.smith@example.com', 'Portland'),
(3, 'Carol', 'Williams', 'carol.williams@example.com', 'Denver'),
(4, 'David', 'Brown', 'david.brown@example.com', 'Austin'),
(5, 'Eve', 'Davis', 'eve.davis@example.com', 'Chicago'),
(6, 'Frank', 'Miller', 'frank.miller@example.com', 'Boston'),
(7, 'Grace', 'Wilson', 'grace.wilson@example.com', 'Miami'),
(8, 'Hank', 'Moore', 'hank.moore@example.com', 'Atlanta'),
(9, 'Ivy', 'Taylor', 'ivy.taylor@example.com', 'Dallas'),
(10, 'Jack', 'Anderson', 'jack.anderson@example.com', 'Phoenix');
|
Step 3: Create the SCD Type 2 streaming table with AUTO CDC
This is the core of the entire pipeline -- and it is just one statement:
CREATE OR REFRESH STREAMING TABLE mycatalog.default.customers_scd2
SCHEDULE REFRESH EVERY 1 DAY
FLOW AUTO CDC
FROM STREAM mycatalog.default.customers
WITH (readChangeFeed = true)
KEYS (customer_id)
APPLY AS DELETE WHEN _change_type = 'delete'
SEQUENCE BY _commit_timestamp
COLUMNS * EXCEPT (_change_type, _commit_version, _commit_timestamp)
STORED AS SCD TYPE 2;
|
Let's break down what each line does:
- `FROM STREAM ... WITH (readChangeFeed = true)` -- Reads the Change Data Feed from the source table, which provides row-level insert, update, and delete events.
- `KEYS (customer_id)` -- Identifies each customer uniquely.
- `APPLY AS DELETE WHEN _change_type = 'delete'` -- When CDF reports a delete, AUTO CDC closes the record's history (sets __END_AT) rather than leaving a ghost row.
- `SEQUENCE BY _commit_timestamp` -- Uses the CDF commit timestamp to order changes correctly -- even if they arrive out of sequence.
- `COLUMNS * EXCEPT (...)` -- Strips out the CDF metadata columns so the target table contains only business data.
- `STORED AS SCD TYPE 2` -- Maintains full history. Every change creates a new version with __START_AT and __END_AT tracking. If you want to only maintain the current version, simply swap this out for `STORED AS SCD TYPE 1.`
Step 4: Simulate real-world changes
Now let's simulate what happens in a production system -- new customers sign up, existing customers update their information, and some customers are removed.
Insert 2 new customers:
INSERT INTO mycatalog.default.customers VALUES
(11, 'Karen', 'Thomas', 'karen.thomas@example.com', 'San Francisco'),
(12, 'Leo', 'Jackson', 'leo.jackson@example.com', 'New York');
|
Update 2 existing customers -- Alice moved to Los Angeles, Bob changed his email:
UPDATE mycatalog.default.customers SET city = 'Los Angeles' WHERE customer_id = 1;
UPDATE mycatalog.default.customers SET email = 'bob.updated@example.com' WHERE customer_id = 2;
|
Delete 2 customers:
DELETE FROM mycatalog.default.customers WHERE customer_id = 9;
DELETE FROM mycatalog.default.customers WHERE customer_id = 10;
|
Step 5: Refresh and query the results
Refresh the streaming table to process all the changes:
REFRESH STREAMING TABLE mycatalog.default.customers_scd2;
|
Now query the SCD Type 2 table:
SELECT * FROM mycatalog.default.customers_scd2;
|

Here is what you will see in the results:
- Customers 3-8 -- Unchanged. One row each with __END_AT = NULL (current and active).
- Customer 1 (Alice) -- Two rows. The original Seattle version has a __END_AT timestamp marking when it was superseded. The new Los Angeles version has __END_AT = NULL -- she is current.
- Customer 2 (Bob) -- Two rows. Same pattern -- the old email version is closed, the new email version is current.
- Customers 9 and 10 (Ivy and Jack) -- Their records have __END_AT set to the deletion timestamp. They are closed -- not deleted from the table, but historically tracked as removed.
- Customers 11 and 12 (Karen and Leo) -- One row each with __END_AT = NULL. New additions, fully tracked from their first appearance.
No dedup logic. No checksums. No window functions. No multi-arm MERGE. AUTO CDC handled all of it -- inserts, updates, deletes, and history tracking -- from a single declarative statement.
When to Use AUTO CDC
AUTO CDC is a natural fit any time you are building dimensional tables from change data:
- CDC feeds from operational databases -- Whether you are consuming Change Data Feed from Delta tables, Debezium streams, or any source that provides row-level changes, AUTO CDC handles the complexity of turning those changes into clean dimensions.
- Replacing existing MERGE-based SCD pipelines -- If you have existing pipelines with the 150-line MERGE pattern, AUTO CDC is a straightforward modernization path. Same result, a fraction of the code, and better error recovery.
- Out-of-order or late-arriving data -- If your source data does not arrive in perfect chronological order -- and it rarely does in production -- AUTO CDC's SEQUENCE BY handles this automatically instead of silently corrupting your dimensions.
Note: What about sources without a change feed? Many legacy databases and JDBC sources only provide periodic snapshots, not row-level change events. AUTO CDC FROM SNAPSHOT -- which automatically infers inserts, updates, and deletes by comparing successive snapshots -- is available today in Spark Declarative Pipelines via Python. SQL support is not yet available but is planned for a future release.
Requirements
- Compute: Serverless SQL Warehouse, or SDP Pro / Advanced edition
- Runtime: Databricks Runtime 17.3 or later
Get Started
AUTO CDC is available today in Beta. Here is how to get started:
Try it with your own data -- swap out the mycatalog.default.customers table for one of your CDC sources, and see how much procedural code you can retire. Take an existing MERGE-based SCD pipeline, rewrite it with AUTO CDC, and compare the results. We think you will be surprised how much complexity disappears.
If you have feedback or questions, drop them in the comments below.