SCD2 table migration using LakeFlow
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
A source SQL DB of an operational systems delivers daily snapshots to a legacy DWH with SCD2 logic enabled. Now for a migration to Databricks. Lets look at the table "customer" (SCD2; with customer_id, valid_from and valid_to columns). On migration day t -1 I copy the table to Databricks. The valid_from can be used as the sequence column.
After the migration, when the source SQL DB delivers daily snapshots to Databricks directly I want to use dp.auto_cdc_from_snapshot_flow() which requires dp.create_streaming_table(). If I fill up this streaming table with the state of the customer SCD2 table from day t -1 and run dp.auto_cdc_from_snapshot_flow() it says the streaming table is not empty.
How to prepare this scenario so that dp.auto_cdc_from_snapshot_flow() works with all history data in the target table.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
def get_snapshot(version):
# If version is None, this is the first run (Migration Day)
if version is None:
# Load your legacy SCD2 data as the initial snapshot
df = spark.read.table("legacy_db.customer_migration_t_minus_1")
return (df, "v1")
# Logic for subsequent daily snapshots (Post-Migration)
# The 'version' variable helps track what has already been processed
df_daily = spark.read.format("cloudFiles").load("/path/to/daily/snapshots")
return (df_daily, "v2_plus")
dp.create_auto_cdc_from_snapshot_flow(
target = "customer_scd2",
source = get_snapshot,
keys = ["customer_id"],
sequence_by = "valid_from", # Use your legacy valid_from as the sequence
stored_as_scd_type = 2
)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
Hello @peter_hoeltschi !
I think your workspace or user permissions only allow SQL/serverless compute not classi clusters because even if you have enterprise pay-as-you-go at the account level does not automatically mean every user can create all-purpose compute in every workspace.
Ask a workspace admin to check that you have workspace and allow unrestricted cluster creation is checked or at least access to a compute policy such as personal compute
Databricks docs say this entitlement controls whether a user can provision unrestricted compute, and without it users can only create compute through assigned policies. https://docs.databricks.com/aws/en/security/auth/entitlements
Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP