cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

create_auto_cdc_from_snapshot_flow vs create_auto_cdc_flow – when is snapshot CDC actually worth it?

batch_bender
New Contributor

I am deciding between create_auto_cdc_from_snapshot_flow() and create_auto_cdc_flow() in a pipeline.

My source is a daily full snapshot table:

  • No operation column (no insert/update/delete flags)

  • Order can be derived from snapshot_date (sequence by)
  • Rows are unique based on key id

create_auto_cdc_from_snapshot_flow() fits this model, but it requires the source lambda returning (DataFrame, snapshot_version), which feels heavy to implement compared to just producing CDC rows and using create_auto_cdc_flow().

So the question is:

For a system that only provides full daily snapshots (no row-level operations), what are the real technical advantages of using create_auto_cdc_from_snapshot_flow()?

Is snapshot-based AUTO CDC mainly a convenience API, or does it give better correctness, SCD2 handling, or performance guarantees than  create_auto_cdc_flow() approach?

3 REPLIES 3

pradeep_singh
Contributor

If your source only emits full daily snapshots, create_auto_cdc_from_snapshot_flow() is purpose-built for this and will likely be simpler and safer to operate than synthesizing CDC events for create_auto_cdc_flow(). It automatically computes inserts/updates/deletes between snapshots, supports SCD1/2, enforces strict snapshot ordering, and avoids you having to model delete/truncate semantics or event sequencing. 

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @batch_bender ,

For your case, I recommend using create_auto_cdc_from_snapshot_flow(). Since your system provides full snapshots without row-level operation data, this is the only way to accurately generate SCD tables.

How it works: It compares the new snapshot to the target to identify changes:

  • New keys → INSERT

  • Existing keys with different values → UPDATE

  • Keys missing from the snapshot but present in target → DELETE

Implementation Details:

The lambda function is necessary only if there are multiple historical snapshots in the landing zone to be processed. 

  • Processing History: If you have multiple historical snapshots in your landing zone, you'll need a lambda function to tell the flow how to order them.

  • Periodic Snapshots: If the source simply overwrites the old snapshot with a new one each day, you can just pass the path or table name directly.

Performance Note: Becausecreate_auto_cdc_from_snapshot_flow() requires a full scan of every snapshot, it can be heavy on large datasets. If the source system eventually gains the ability to provide row-level logs (CDC), it's better to switch to create_auto_cdc_flow() for better performance.

Hope this helps!

 

 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @batch_bender,

Given your scenario (daily full snapshots, no operation column, ordering by snapshot_date, unique key ID), create_auto_cdc_from_snapshot_flow() is the right tool for the job, and it is more than just a convenience wrapper. Here is a breakdown of the real technical differences.

WHAT SNAPSHOT CDC DOES UNDER THE HOOD

When you use create_auto_cdc_from_snapshot_flow(), the Lakeflow Spark Declarative Pipelines (SDP) engine automatically compares consecutive snapshots and infers inserts, updates, and deletes for you:

- A row present in the new snapshot but absent from the previous snapshot = INSERT
- A row present in both snapshots but with changed non-key columns = UPDATE
- A row present in the previous snapshot but absent from the new snapshot = DELETE

This diffing logic is built into the SDP runtime and is optimized for this exact pattern. You do not need to write any comparison logic yourself.

WHY NOT JUST BUILD CDC ROWS MANUALLY AND USE create_auto_cdc_flow()?

You could, but there are several practical and correctness reasons to prefer the snapshot API for your use case:

1. CORRECTNESS GUARANTEES
The snapshot API handles all the edge cases in diff detection atomically within the pipeline transaction. If you build CDC rows yourself (for example, by joining today's snapshot against yesterday's to detect changes), you are responsible for getting every edge case right: null handling in comparisons, ensuring no rows are missed or double-counted, and handling partial failures. The built-in snapshot diffing avoids these pitfalls.

2. SCD TYPE 2 TRACKING
If you ever need SCD type 2 history, the snapshot API populates __START_AT and __END_AT columns automatically using the snapshot version as the sequence marker. With create_auto_cdc_flow(), you would need to supply your own sequence_by column and explicitly tag each row with an operation type (INSERT, UPDATE, DELETE), which you said your source does not provide.

3. VERSION MANAGEMENT
The snapshot API tracks which snapshot version was last processed and picks up from there on the next pipeline run. For the historical snapshot variant (callable function returning DataFrame + version), this means you get exactly-once processing with automatic bookmarking. With create_auto_cdc_flow(), you would need to manage this state yourself or rely on Structured Streaming checkpoints in your source.

4. LESS CODE TO MAINTAIN
With the snapshot approach, the entire pipeline can look like this:

from pyspark import pipelines as dp

@dp.view(name="daily_snapshot")
def daily_snapshot():
  return spark.read.table("bronze.my_daily_snapshot_table")

dp.create_streaming_table("target_table")

dp.create_auto_cdc_from_snapshot_flow(
  target="target_table",
  source="daily_snapshot",
  keys=["key_id"],
  stored_as_scd_type=1
)

For this "periodic snapshot" pattern where you read from a table or view, the source is just a string name referencing the view. The pipeline ingests the current state of that view on each update and diffs it against the previous snapshot automatically. You do not need the callable function variant (the one returning DataFrame + snapshot_version) unless you are processing historical file-based snapshots from cloud storage.

WHEN THE CALLABLE FUNCTION VARIANT IS USEFUL

The source parameter accepting a callable that returns (DataFrame, snapshot_version) is designed for a specific scenario: you have a series of snapshot files in cloud storage (e.g., daily exports from Oracle or MySQL) and you want to replay them in order. In that case, you iterate over the files yourself:

def next_snapshot_and_version(latest_snapshot_version):
  latest_snapshot_version = latest_snapshot_version or 0
  next_version = latest_snapshot_version + 1
  path = f"/mnt/snapshots/daily_{next_version}.parquet"
  if file_exists(path):
      return (spark.read.parquet(path), next_version)
  return None

dp.create_auto_cdc_from_snapshot_flow(
  target="target_table",
  source=next_snapshot_and_version,
  keys=["key_id"],
  stored_as_scd_type=2
)

If your daily snapshot is already landing as a table (or you can read it as a view), you do not need this pattern at all.

WHEN create_auto_cdc_flow() IS THE BETTER CHOICE

Use create_auto_cdc_flow() when your source already provides row-level change events with operation metadata (INSERT, UPDATE, DELETE flags) and a sequence column, for example, from Debezium, a database CDC connector, or Delta Change Data Feed. In that case, the data already tells you what changed, and snapshot diffing would be unnecessary overhead.

SUMMARY

For your daily full-snapshot source with no operation column, create_auto_cdc_from_snapshot_flow() is the intended and recommended approach. It gives you built-in diffing, automatic version tracking, and SCD type 1 or 2 support with minimal code. The "periodic snapshot" variant (source as a view/table name string) keeps the implementation simple. Reserve the callable function variant for file-based historical replay scenarios.

Documentation reference:
https://docs.databricks.com/en/dlt/cdc.html

Note: SDP pipeline editions Pro or Advanced (or serverless) are required for the CDC APIs.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.