aleksandra_ch
Databricks Employee
Databricks Employee

Hi @batch_bender ,

For your case, I recommend using create_auto_cdc_from_snapshot_flow(). Since your system provides full snapshots without row-level operation data, this is the only way to accurately generate SCD tables.

How it works: It compares the new snapshot to the target to identify changes:

  • New keys → INSERT

  • Existing keys with different values → UPDATE

  • Keys missing from the snapshot but present in target → DELETE

Implementation Details:

The lambda function is necessary only if there are multiple historical snapshots in the landing zone to be processed. 

  • Processing History: If you have multiple historical snapshots in your landing zone, you'll need a lambda function to tell the flow how to order them.

  • Periodic Snapshots: If the source simply overwrites the old snapshot with a new one each day, you can just pass the path or table name directly.

Performance Note: Becausecreate_auto_cdc_from_snapshot_flow() requires a full scan of every snapshot, it can be heavy on large datasets. If the source system eventually gains the ability to provide row-level logs (CDC), it's better to switch to create_auto_cdc_flow() for better performance.

Hope this helps!