Hi @batch_bender,
Given your scenario (daily full snapshots, no operation column, ordering by snapshot_date, unique key ID), create_auto_cdc_from_snapshot_flow() is the right tool for the job, and it is more than just a convenience wrapper. Here is a breakdown of the real technical differences.
WHAT SNAPSHOT CDC DOES UNDER THE HOOD
When you use create_auto_cdc_from_snapshot_flow(), the Lakeflow Spark Declarative Pipelines (SDP) engine automatically compares consecutive snapshots and infers inserts, updates, and deletes for you:
- A row present in the new snapshot but absent from the previous snapshot = INSERT
- A row present in both snapshots but with changed non-key columns = UPDATE
- A row present in the previous snapshot but absent from the new snapshot = DELETE
This diffing logic is built into the SDP runtime and is optimized for this exact pattern. You do not need to write any comparison logic yourself.
WHY NOT JUST BUILD CDC ROWS MANUALLY AND USE create_auto_cdc_flow()?
You could, but there are several practical and correctness reasons to prefer the snapshot API for your use case:
1. CORRECTNESS GUARANTEES
The snapshot API handles all the edge cases in diff detection atomically within the pipeline transaction. If you build CDC rows yourself (for example, by joining today's snapshot against yesterday's to detect changes), you are responsible for getting every edge case right: null handling in comparisons, ensuring no rows are missed or double-counted, and handling partial failures. The built-in snapshot diffing avoids these pitfalls.
2. SCD TYPE 2 TRACKING
If you ever need SCD type 2 history, the snapshot API populates __START_AT and __END_AT columns automatically using the snapshot version as the sequence marker. With create_auto_cdc_flow(), you would need to supply your own sequence_by column and explicitly tag each row with an operation type (INSERT, UPDATE, DELETE), which you said your source does not provide.
3. VERSION MANAGEMENT
The snapshot API tracks which snapshot version was last processed and picks up from there on the next pipeline run. For the historical snapshot variant (callable function returning DataFrame + version), this means you get exactly-once processing with automatic bookmarking. With create_auto_cdc_flow(), you would need to manage this state yourself or rely on Structured Streaming checkpoints in your source.
4. LESS CODE TO MAINTAIN
With the snapshot approach, the entire pipeline can look like this:
from pyspark import pipelines as dp
@dp.view(name="daily_snapshot")
def daily_snapshot():
return spark.read.table("bronze.my_daily_snapshot_table")
dp.create_streaming_table("target_table")
dp.create_auto_cdc_from_snapshot_flow(
target="target_table",
source="daily_snapshot",
keys=["key_id"],
stored_as_scd_type=1
)
For this "periodic snapshot" pattern where you read from a table or view, the source is just a string name referencing the view. The pipeline ingests the current state of that view on each update and diffs it against the previous snapshot automatically. You do not need the callable function variant (the one returning DataFrame + snapshot_version) unless you are processing historical file-based snapshots from cloud storage.
WHEN THE CALLABLE FUNCTION VARIANT IS USEFUL
The source parameter accepting a callable that returns (DataFrame, snapshot_version) is designed for a specific scenario: you have a series of snapshot files in cloud storage (e.g., daily exports from Oracle or MySQL) and you want to replay them in order. In that case, you iterate over the files yourself:
def next_snapshot_and_version(latest_snapshot_version):
latest_snapshot_version = latest_snapshot_version or 0
next_version = latest_snapshot_version + 1
path = f"/mnt/snapshots/daily_{next_version}.parquet"
if file_exists(path):
return (spark.read.parquet(path), next_version)
return None
dp.create_auto_cdc_from_snapshot_flow(
target="target_table",
source=next_snapshot_and_version,
keys=["key_id"],
stored_as_scd_type=2
)
If your daily snapshot is already landing as a table (or you can read it as a view), you do not need this pattern at all.
WHEN create_auto_cdc_flow() IS THE BETTER CHOICE
Use create_auto_cdc_flow() when your source already provides row-level change events with operation metadata (INSERT, UPDATE, DELETE flags) and a sequence column, for example, from Debezium, a database CDC connector, or Delta Change Data Feed. In that case, the data already tells you what changed, and snapshot diffing would be unnecessary overhead.
SUMMARY
For your daily full-snapshot source with no operation column, create_auto_cdc_from_snapshot_flow() is the intended and recommended approach. It gives you built-in diffing, automatic version tracking, and SCD type 1 or 2 support with minimal code. The "periodic snapshot" variant (source as a view/table name string) keeps the implementation simple. Reserve the callable function variant for file-based historical replay scenarios.
Documentation reference:
https://docs.databricks.com/en/dlt/cdc.html
Note: SDP pipeline editions Pro or Advanced (or serverless) are required for the CDC APIs.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.