Databricks Community

CarlosR

Every data warehouse has the same foundational problem: keeping dimension tables in sync with operational systems. Customer records change. Orders get updated. Accounts get closed. Getting those changes into your warehouse -- correctly, reliably, and in the right order -- is the most critical pipeline pattern in any data platform. And yet, the tool most teams reach for -- MERGE INTO -- was never designed for this job. It does not understand event ordering, deduplication, or delete propagation. Engineers build all of that themselves, and what starts as "just merge the new records" spirals into 150+ lines of procedural SQL. Every team reimplements this pattern. Some get it subtly wrong. The failures are silent -- you find out weeks later when a dashboard is wrong or an audit fails.

AUTO CDC is a new declarative SQL API in Databricks SQL that handles all of this for you. You declare your keys, your ordering column, and your SCD type -- and AUTO CDC handles dedup, ordering, deletes, checkpointing, and history tracking automatically. What used to take 150+ lines of fragile MERGE logic becomes about 7 lines of SQL. AUTO CDC is available now in Beta -- you can try it today in the Databricks SQL Query Editor, running directly on your SQL warehouse.

Screenshot 2026-04-23 at 17.23.21.png

In this post, we will walk through the pain points that every warehouse engineer knows too well, show how AUTO CDC eliminates them, and build a working SCD Type 2 pipeline from scratch.

The Challenge: MERGE INTO creates Technical Debt

If you have built Change Data Capture (CDC) pipelines with MERGE INTO, these scenarios will feel familiar:

Multiple updates to the same key in one batch -- Your source sends two updates for customer_id = 42 in the same load. Which one wins? Without explicit dedup logic, the answer is "whichever the query optimizer picks" -- and that is not deterministic.
Out-of-order records across batches -- A record with updated_at = 10:05 AM arrives in batch 3, but the record with updated_at = 10:02 AM arrives in batch 4. Your MERGE happily overwrites the newer data with the older data. Silently.
Delete handling -- You process a delete for customer_id = 99, then a stale update for the same customer arrives in the next batch. Your MERGE re-inserts the record. The customer is back from the dead.
SCD Type 2 on top of all that -- Now layer in historical tracking. You need checksums to detect real changes, CTEs to stage the merge, LAG/LEAD window functions to compute effective dates, and a MERGE statement that handles matched-updates, matched-closes, and not-matched-inserts all at once.
Error recovery -- Something fails mid-pipeline. The recovery plan? Truncate the target table and rerun from the beginning. Hope it finishes this time.

Here is a simplified glimpse of what that procedural approach looks like, and this is the short version!

Procedural SCD Type 2: The "short" version (~30 lines, real-world is 150+)

Step 1: Stage incoming data, dedup by key, keep latest

CREATE OR REPLACE TEMP VIEW staged AS
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY customer_id ORDER BY updated_at DESC
  ) AS rn
  FROM customer_updates
) WHERE rn = 1;

Step 2: Compute checksums to detect real changes

CREATE OR REPLACE TEMP VIEW changes AS
SELECT s.*, md5(concat_ws('|', s.name, s.email, s.city)) AS new_hash
FROM staged s
JOIN customer_dim d ON s.customer_id = d.customer_id AND d.is_current = true
WHERE md5(concat_ws('|', d.name, d.email, d.city)) != md5(concat_ws('|', s.name, s.email, s.city));

Step 3: Close existing records + insert new versions

MERGE INTO customer_dim AS target
USING (
  SELECT customer_id, name, email, city, updated_at,
         true AS is_current, updated_at AS effective_from, NULL AS effective_to
  FROM changes
  UNION ALL
  SELECT d.customer_id, d.name, d.email, d.city, d.updated_at,
         false, d.effective_from, c.updated_at
  FROM customer_dim d JOIN changes c ON d.customer_id = c.customer_id AND d.is_current = true
) AS source
ON target.customer_id = source.customer_id AND target.is_current = source.is_current
WHEN MATCHED AND source.is_current = false THEN UPDATE SET
  is_current = false, effective_to = source.effective_to
WHEN NOT MATCHED THEN INSERT *;

And this still does not handle out-of-order records, deletes, or error recovery. Every one of those requires more code, more edge cases, and more testing.

How AUTO CDC Works

Instead of writing procedural logic to handle every edge case, you declare what you want and AUTO CDC figures out how:

Out-of-order records -- SEQUENCE BY ensures records are processed in the correct order, regardless of when they arrive. Late-arriving records are slotted into the right position automatically.
Delete propagation -- APPLY AS DELETE WHEN lets you declaratively specify which records represent deletes. No more ghost records.
SCD Type 1 & Type 2 -- A single keyword: STORED AS SCD TYPE 1 keeps only the latest version. STORED AS SCD TYPE 2 maintains full history with __START_AT and __END_AT columns -- no checksums, no window functions, no CTEs.
Partial updates -- IGNORE NULL UPDATES prevents null values in sparse CDC records from overwriting existing data -- a common headache with Debezium and other CDC connectors that send partial rows.
Selective history tracking -- TRACK HISTORY ON lets you control which columns trigger a new SCD Type 2 history row. Changes to untracked columns (e.g., last_login_at) update the current row in place instead of creating unnecessary version bloat.
Error recovery -- Built-in checkpointing means the pipeline picks up exactly where it left off. No truncate-and-rerun.

Here is how those pain points map across approaches:

Pain Point	Procedural MERGE	AUTO CDC
Dedup / multiple updates per key	ROW_NUMBER() + manual dedup CTE	Handled automatically
Out-of-order records	Complex CASE logic or ignored entirely	SEQUENCE BY
Delete handling	Manual MERGE with delete flags	APPLY AS DELETE WHEN
SCD Type 2 history	Checksums + LAG/LEAD + multi-arm MERGE	STORED AS SCD TYPE 2
Error recovery	Truncate and rerun	Automatic checkpointing
Partial / sparse updates	COALESCE() on every column	IGNORE NULL UPDATES
Selective history tracking	Custom checksums on selected columns	TRACK HISTORY ON
Lines of code	150+	~7

AUTO CDC Syntax at a Glance

The syntax is built around a few intuitive building blocks:

`KEYS` -- The business key(s) that uniquely identify a record
`SEQUENCE BY` -- The column that determines record order (timestamp, sequence number, etc.)
`STORED AS SCD TYPE 1|2` -- Whether to keep only the latest version or full history (default: SCD Type 1)
`APPLY AS DELETE WHEN` -- A condition that marks records as deletes

SCD Type 1 is the default, and that is a deliberate design choice. The most common CDC problem is deceptively simple: keep a "golden record" table that always reflects the latest state of each entity. Think of a customer dimension that powers your dashboards and reports -- you do not need every historical version, you just need the current truth. But as we saw earlier, even this "simple" case gets complicated fast with MERGE INTO when you factor in duplicate keys, out-of-order arrivals, and delete handling. AUTO CDC makes the common case effortless.

Here is what SCD Type 1 looks like -- keep only the latest version of each booking:

CREATE OR REFRESH STREAMING TABLE bookings_current
  FLOW AUTO CDC
  FROM STREAM samples.wanderbricks.booking_updates
  KEYS (booking_id)
  SEQUENCE BY updated_at
  STORED AS SCD TYPE 1;

Seven lines. No dedup logic. No out-of-order handling. No error recovery code. It just works.

But sometimes the current state is not enough. Regulatory audits need to know what a customer's address was on a specific date. Finance teams need to reconstruct metrics as they existed at quarter-close. Fraud investigations need to trace every change to an account. That is when you upgrade from the default to SCD Type 2 -- and the only change is one line:

CREATE OR REFRESH STREAMING TABLE bookings_history
  FLOW AUTO CDC
  FROM STREAM samples.wanderbricks.booking_updates
  KEYS (booking_id)
  SEQUENCE BY updated_at
  STORED AS SCD TYPE 2;

The only difference from the default SCD Type 1 is the last line. AUTO CDC automatically adds __START_AT and __END_AT columns to track when each version of a record was active. The current version has __END_AT = NULL. You start with Type 1 as the default, and when the business need arises, upgrading to full history is a one-line change -- not a rewrite.

Want to avoid version bloat? Use TRACK HISTORY ON to control which columns trigger new history rows. For example, if changes to city should not create a new version:

CREATE OR REFRESH STREAMING TABLE bookings_history_selective
  FLOW AUTO CDC
  FROM STREAM samples.wanderbricks.booking_updates
  KEYS (booking_id)
  SEQUENCE BY updated_at
  STORED AS SCD TYPE 2
  TRACK HISTORY ON * EXCEPT (city);

With this configuration, changes to city update the current row in place, while changes to any other column create a new historical version. This keeps your history table focused on the changes that actually matter to your business.

Building an SCD Type 2 Model in your Data Warehouse

Let's build a complete example from scratch using the Databricks SQL Query Editor connected to a SQL warehouse. Every step below runs directly in the Query Editor -- no notebooks, no pipeline UI, no Python. We will create a customer table with Change Data Feed (CDF) enabled, simulate real-world changes -- inserts, updates, and deletes -- and watch AUTO CDC handle all of it declaratively.

Step 1: Create a source table with Change Data Feed enabled

Change Data Feed captures row-level changes on a Delta table. This is our CDC source.

CREATE OR REPLACE TABLE mycatalog.default.customers (
  customer_id INT,
  first_name STRING,
  last_name STRING,
  email STRING,
  city STRING
)
TBLPROPERTIES (delta.enableChangeDataFeed = true);

Step 2: Insert initial records

Load 10 customers as our starting dataset:

INSERT INTO mycatalog.default.customers VALUES
  (1,  'Alice',  'Johnson',  'alice.johnson@example.com',  'Seattle'),
  (2,  'Bob',    'Smith',    'bob.smith@example.com',      'Portland'),
  (3,  'Carol',  'Williams', 'carol.williams@example.com', 'Denver'),
  (4,  'David',  'Brown',    'david.brown@example.com',    'Austin'),
  (5,  'Eve',    'Davis',    'eve.davis@example.com',      'Chicago'),
  (6,  'Frank',  'Miller',   'frank.miller@example.com',   'Boston'),
  (7,  'Grace',  'Wilson',   'grace.wilson@example.com',   'Miami'),
  (8,  'Hank',   'Moore',    'hank.moore@example.com',     'Atlanta'),
  (9,  'Ivy',    'Taylor',   'ivy.taylor@example.com',     'Dallas'),
  (10, 'Jack',   'Anderson', 'jack.anderson@example.com',  'Phoenix');

Step 3: Create the SCD Type 2 streaming table with AUTO CDC

This is the core of the entire pipeline -- and it is just one statement:

CREATE OR REFRESH STREAMING TABLE mycatalog.default.customers_scd2
  SCHEDULE REFRESH EVERY 1 DAY
  FLOW AUTO CDC
  FROM STREAM mycatalog.default.customers
    WITH (readChangeFeed = true)
  KEYS (customer_id)
  APPLY AS DELETE WHEN _change_type = 'delete'
  SEQUENCE BY _commit_timestamp
  COLUMNS * EXCEPT (_change_type, _commit_version, _commit_timestamp)
  STORED AS SCD TYPE 2;

Let's break down what each line does:

`FROM STREAM ... WITH (readChangeFeed = true)` -- Reads the Change Data Feed from the source table, which provides row-level insert, update, and delete events.
`KEYS (customer_id)` -- Identifies each customer uniquely.
`APPLY AS DELETE WHEN _change_type = 'delete'` -- When CDF reports a delete, AUTO CDC closes the record's history (sets __END_AT) rather than leaving a ghost row.
`SEQUENCE BY _commit_timestamp` -- Uses the CDF commit timestamp to order changes correctly -- even if they arrive out of sequence.
`COLUMNS * EXCEPT (...)` -- Strips out the CDF metadata columns so the target table contains only business data.
`STORED AS SCD TYPE 2` -- Maintains full history. Every change creates a new version with __START_AT and __END_AT tracking. If you want to only maintain the current version, simply swap this out for `STORED AS SCD TYPE 1.`

Step 4: Simulate real-world changes

Now let's simulate what happens in a production system -- new customers sign up, existing customers update their information, and some customers are removed.

Insert 2 new customers:

INSERT INTO mycatalog.default.customers VALUES
  (11, 'Karen', 'Thomas',  'karen.thomas@example.com',  'San Francisco'),
  (12, 'Leo',   'Jackson', 'leo.jackson@example.com',   'New York');

Update 2 existing customers -- Alice moved to Los Angeles, Bob changed his email:

UPDATE mycatalog.default.customers SET city = 'Los Angeles' WHERE customer_id = 1;
UPDATE mycatalog.default.customers SET email = 'bob.updated@example.com' WHERE customer_id = 2;

Delete 2 customers:

DELETE FROM mycatalog.default.customers WHERE customer_id = 9;
DELETE FROM mycatalog.default.customers WHERE customer_id = 10;

Step 5: Refresh and query the results

Refresh the streaming table to process all the changes:

REFRESH STREAMING TABLE mycatalog.default.customers_scd2;

Now query the SCD Type 2 table:

SELECT * FROM mycatalog.default.customers_scd2;

Here is what you will see in the results:

Customers 3-8 -- Unchanged. One row each with __END_AT = NULL (current and active).
Customer 1 (Alice) -- Two rows. The original Seattle version has a __END_AT timestamp marking when it was superseded. The new Los Angeles version has __END_AT = NULL -- she is current.
Customer 2 (Bob) -- Two rows. Same pattern -- the old email version is closed, the new email version is current.
Customers 9 and 10 (Ivy and Jack) -- Their records have __END_AT set to the deletion timestamp. They are closed -- not deleted from the table, but historically tracked as removed.
Customers 11 and 12 (Karen and Leo) -- One row each with __END_AT = NULL. New additions, fully tracked from their first appearance.

No dedup logic. No checksums. No window functions. No multi-arm MERGE. AUTO CDC handled all of it -- inserts, updates, deletes, and history tracking -- from a single declarative statement.

When to Use AUTO CDC

AUTO CDC is a natural fit any time you are building dimensional tables from change data:

CDC feeds from operational databases -- Whether you are consuming Change Data Feed from Delta tables, Debezium streams, or any source that provides row-level changes, AUTO CDC handles the complexity of turning those changes into clean dimensions.
Replacing existing MERGE-based SCD pipelines -- If you have existing pipelines with the 150-line MERGE pattern, AUTO CDC is a straightforward modernization path. Same result, a fraction of the code, and better error recovery.
Out-of-order or late-arriving data -- If your source data does not arrive in perfect chronological order -- and it rarely does in production -- AUTO CDC's SEQUENCE BY handles this automatically instead of silently corrupting your dimensions.

Note: What about sources without a change feed? Many legacy databases and JDBC sources only provide periodic snapshots, not row-level change events. AUTO CDC FROM SNAPSHOT -- which automatically infers inserts, updates, and deletes by comparing successive snapshots -- is available today in Spark Declarative Pipelines via Python. SQL support is not yet available but is planned for a future release.

Requirements

Compute: Serverless SQL Warehouse, or SDP Pro / Advanced edition
Runtime: Databricks Runtime 17.3 or later

Get Started

AUTO CDC is available today in Beta. Here is how to get started:

Incremental ETL tutorial -- Tutorial on how to run ETL inside Databricks SQL
AUTO CDC in Databricks SQL Documentation -- Full syntax reference, parameters, and examples
Streaming Tables in Databricks SQL -- Background on streaming tables and flows in DBSQL

Try it with your own data -- swap out the mycatalog.default.customers table for one of your CDC sources, and see how much procedural code you can retire. Take an existing MERGE-based SCD pipeline, rewrite it with AUTO CDC, and compare the results. We think you will be surprised how much complexity disappears.

If you have feedback or questions, drop them in the comments below.

pepco-personal

With standard tables we have the possibility to use columns with identity (sequence like) value. From the modeling perspective surrogate keys are considered as a good practice. How autocdc handles surrogate keys? You probably don't want your BI tool of choice join the facts and dimension tables on business keys as especially in SCD2 they are not unique.

Surya2

Thank you for such an informative post on one of my favourite topics—MERGE INTO.

I completely agree that AUTO CDC significantly simplifies handling both SCD Type 1 and Type 2 patterns. The side-by-side comparison between a manual MERGE INTO implementation and AUTO CDC makes the difference very clear. From a developer’s perspective, AUTO CDC not only expedites the build phase but also makes long-term maintenance much easier.

That said, while I agree with most of the advantages highlighted, I’m not entirely convinced that “Error recovery” is a distinct benefit of AUTO CDC over a manual MERGE INTO approach. My understanding is that MERGE INTO operations are atomic—either the entire transaction succeeds or it fails without partial updates. Based on that, I would expect error recovery characteristics to be similar.

I’d be interested to hear your thoughts on this in case I’m overlooking something.

Thanks again for sharing such valuable insights.

shanelleroman97

Thanks for commenting both!

We're working on adding identity column support for AUTO CDC! As of right now, we recommend adding the surrogate key in the source table before AUTO CDC.

You're correct that MERGE INTO itself is atomic, but if a batch is incorrect but still successfully processed (for example, duplicates, out-of-order updates, or partial data), a manual MERGE can leave the table in a wrong state—and recovery often means truncating and rerunning. AUTO CDC reduces that risk by handling ordering, deduplication, and idempotency upfront, and then uses checkpointing to safely resume if something does fail.

Databricks Community

From 150 Lines of MERGE INTO to 7 Lines of SQL: AUTO CDC Comes to Databricks SQL

The Challenge: MERGE INTO creates Technical Debt

How AUTO CDC Works

AUTO CDC Syntax at a Glance

Building an SCD Type 2 Model in your Data Warehouse

When to Use AUTO CDC

Requirements

Get Started

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks