SCD2 table migration using LakeFlow

peter_hoeltschi · a week ago

A source SQL DB of an operational systems delivers daily snapshots to a legacy DWH with SCD2 logic enabled. Now for a migration to Databricks. Lets look at the table "customer" (SCD2; with customer_id, valid_from and valid_to columns). On migration day t -1 I copy the table to Databricks. The valid_from can be used as the sequence column.

After the migration, when the source SQL DB delivers daily snapshots to Databricks directly I want to use dp.auto_cdc_from_snapshot_flow() which requires dp.create_streaming_table(). If I fill up this streaming table with the state of the customer SCD2 table from day t -1 and run dp.auto_cdc_from_snapshot_flow() it says the streaming table is not empty.

How to prepare this scenario so that dp.auto_cdc_from_snapshot_flow() works with all history data in the target table.