Scaling SCD on Databricks: Then vs Now

wesleyfelipe — Fri, 20 Feb 2026 16:07:26 GMT

Between 2019 and 2021, we built a large-scale lakehouse on Databricks supporting multi-market payments processing (7B+ transactions/year).

If ingestion was complex (covered in Part 1), the Silver layer was even more interesting.

Implementing SCD Type 1 at scale using early versions of Delta Lake required significantly more engineering than many people remember.

Even though Delta Lake introduced ACID guarantees and MERGE support, production-grade SCD pipelines still required custom handling for:

Deduplication of CDC events
Out-of-order updates
Explicit column mapping in MERGE statements
Schema evolution workarounds
Multiple-match conflicts in micro-batches

To make this reliable, we built a fully parameterized Scala framework that:

Applied window-based deduplication
Forced schema evolution via controlled writes
Dynamically generated MERGE statements
Standardized SCD logic across datasets

It worked — but it was heavy.

Fast forward to today, and much of that custom framework logic can be replaced by Lakeflow Declarative Pipelines, specifically the AUTO CDC capability.

AUTO CDC abstracts:

Deduplication and sequencing
Out-of-order handling
SCD Type 1 and Type 2 logic
Delete semantics
Streaming operational complexity

What once required hundreds of lines of Spark framework code can now be expressed declaratively.

That’s a major architectural shift.

I wrote a detailed breakdown of:

The original SCD framework pattern
The specific Delta Lake limitations we had to work around
How AUTO CDC changes the Silver-layer design
What to validate before adopting it in production

🔗 Full article here: https://medium.com/@wesley.felipe/databricks-lakehouse-without-the-workarounds-part-2-scd-840d9748920d

topic Scaling SCD on Databricks: Then vs Now in Community Articles

Scaling SCD on Databricks: Then vs Now