SCD2 table migration using LakeFlow

peter_hoeltschi · a week ago

A source SQL DB of an operational systems delivers daily snapshots to a legacy DWH with SCD2 logic enabled. Now for a migration to Databricks. Lets look at the table "customer" (SCD2; with customer_id, valid_from and valid_to columns). On migration day t -1 I copy the table to Databricks. The valid_from can be used as the sequence column.

After the migration, when the source SQL DB delivers daily snapshots to Databricks directly I want to use dp.auto_cdc_from_snapshot_flow() which requires dp.create_streaming_table(). If I fill up this streaming table with the state of the customer SCD2 table from day t -1 and run dp.auto_cdc_from_snapshot_flow() it says the streaming table is not empty.

How to prepare this scenario so that dp.auto_cdc_from_snapshot_flow() works with all history data in the target table.

rishav_sharma · a week ago

Hi @peter_hoeltschi see if this make sense:

The "Target table is not empty" error occurs because the DLT/LDP engine needs to initialize its own internal metadata and SCD2 tracking columns (__START_AT, __END_AT) from the first micro-batch.

To solve this, instead of manually copying the data into the target table, you should feed the historical state through the flow as the initial snapshot.

1. Prepare your Migration Data

Ensure your migration data from day t-1 is available in a source location (like a Parquet/Delta table or a landing folder). Even if it’s an SCD2 table now, the snapshot flow works best if you can provide it as a state snapshot.

2. Create an Empty Target Streaming Table

Do not use INSERT INTO. Use the DLT/LDP declaration to create a clean slate.

3. Use a Versioned Source Function for the Migration

The key is the source argument. You can pass a function that allows you to "replay" history. If you have multiple daily snapshots from the legacy system, this function can iterate through them. If you only have the final t-1 state, the function will load that first.

def get_snapshot(version):
    # If version is None, this is the first run (Migration Day)
    if version is None:
        # Load your legacy SCD2 data as the initial snapshot
        df = spark.read.table("legacy_db.customer_migration_t_minus_1")
        return (df, "v1") 
    
    # Logic for subsequent daily snapshots (Post-Migration)
    # The 'version' variable helps track what has already been processed
    df_daily = spark.read.format("cloudFiles").load("/path/to/daily/snapshots")
    return (df_daily, "v2_plus")

dp.create_auto_cdc_from_snapshot_flow(
    target = "customer_scd2",
    source = get_snapshot,
    keys = ["customer_id"],
    sequence_by = "valid_from", # Use your legacy valid_from as the sequence
    stored_as_scd_type = 2
)

If your legacy data is already in SCD2 format (multiple rows per ID with valid_to dates), make sure your sequence_by column is unique and strictly increasing per key so the flow can correctly order the history during the initial load.

amirabedhiafi · a week ago

Hello @peter_hoeltschi !

I think your workspace or user permissions only allow SQL/serverless compute not classi clusters because even if you have enterprise pay-as-you-go at the account level does not automatically mean every user can create all-purpose compute in every workspace.

Ask a workspace admin to check that you have workspace and allow unrestricted cluster creation is checked or at least access to a compute policy such as personal compute

Databricks docs say this entitlement controls whether a user can provision unrestricted compute, and without it users can only create compute through assigned policies. https://docs.databricks.com/aws/en/security/auth/entitlements

If this answer resolves your question, could you please mark it as “Accept as Solution”? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP