<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: SCD2 table migration using LakeFlow in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/scd2-table-migration-using-lakeflow/m-p/156248#M54392</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116051"&gt;@peter_hoeltschi&lt;/a&gt;&amp;nbsp;!&lt;/P&gt;&lt;P&gt;I think your workspace or user permissions only allow SQL/serverless compute&amp;nbsp;not classi clusters because even if you have enterprise pay-as-you-go at the account level does not automatically mean every user can create all-purpose compute in every workspace.&lt;/P&gt;&lt;P&gt;Ask a workspace admin to check that you have workspace and allow unrestricted cluster creation is checked&amp;nbsp;or at least access to a compute policy such as personal compute&lt;/P&gt;&lt;P&gt;Databricks docs say this entitlement controls whether a user can provision unrestricted compute, and without it users can only create compute through assigned policies. &lt;A title="https://docs.databricks.com/aws/en/security/auth/entitlements" href="https://docs.databricks.com/aws/en/security/auth/entitlements" target="_self"&gt;https://docs.databricks.com/aws/en/security/auth/entitlements&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 06 May 2026 13:02:26 GMT</pubDate>
    <dc:creator>amirabedhiafi</dc:creator>
    <dc:date>2026-05-06T13:02:26Z</dc:date>
    <item>
      <title>SCD2 table migration using LakeFlow</title>
      <link>https://community.databricks.com/t5/data-engineering/scd2-table-migration-using-lakeflow/m-p/156234#M54390</link>
      <description>&lt;P&gt;A source SQL DB of an operational systems delivers daily snapshots to a legacy DWH with SCD2 logic enabled. Now for a migration to Databricks. Lets look at the table "customer" (SCD2; with customer_id, valid_from and valid_to columns). On migration day t -1 I copy the table to Databricks. The valid_from can be used as the sequence column.&lt;/P&gt;&lt;P&gt;After the migration, when the source SQL DB delivers daily snapshots to Databricks directly I want to use dp.&lt;SPAN&gt;auto_cdc_from_snapshot_flow() which requires&amp;nbsp;dp.create_streaming_table(). If I fill up this streaming table with the state of the customer SCD2 table from day t -1 and run&amp;nbsp;dp.auto_cdc_from_snapshot_flow() it says the streaming table is not empty.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;How to prepare this scenario so that&amp;nbsp;dp.auto_cdc_from_snapshot_flow() works with all history data in the target table.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 06 May 2026 09:29:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/scd2-table-migration-using-lakeflow/m-p/156234#M54390</guid>
      <dc:creator>peter_hoeltschi</dc:creator>
      <dc:date>2026-05-06T09:29:48Z</dc:date>
    </item>
    <item>
      <title>Re: SCD2 table migration using LakeFlow</title>
      <link>https://community.databricks.com/t5/data-engineering/scd2-table-migration-using-lakeflow/m-p/156237#M54391</link>
      <description>&lt;DIV class=""&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116051"&gt;@peter_hoeltschi&lt;/a&gt;&amp;nbsp;see if this make sense:&lt;/DIV&gt;&lt;DIV class=""&gt;The "Target table is not empty" error occurs because the DLT/LDP engine needs to initialize its own internal metadata and SCD2 tracking columns (__START_AT, __END_AT) from the first micro-batch.&lt;/DIV&gt;&lt;DIV class=""&gt;To solve this, instead of manually copying the data into the target table, you should feed the historical state through the flow as the initial snapshot.&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;1. Prepare your Migration Data&lt;/DIV&gt;&lt;DIV class=""&gt;Ensure your migration data from day t-1 is available in a source location (like a Parquet/Delta table or a landing folder). Even if it’s an SCD2 table now, the snapshot flow works best if you can provide it as a state snapshot.&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;2. Create an Empty Target Streaming Table&lt;/DIV&gt;&lt;DIV class=""&gt;Do not use INSERT INTO. Use the DLT/LDP declaration to create a clean slate.&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;3. Use a Versioned Source Function for the Migration&lt;/DIV&gt;&lt;DIV class=""&gt;The key is the source argument. You can pass a function that allows you to "replay" history. If you have multiple daily snapshots from the legacy system, this function can iterate through them. If you only have the final t-1 state, the function will load that first.&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;LI-CODE lang="python"&gt;def get_snapshot(version):
    # If version is None, this is the first run (Migration Day)
    if version is None:
        # Load your legacy SCD2 data as the initial snapshot
        df = spark.read.table("legacy_db.customer_migration_t_minus_1")
        return (df, "v1") 
    
    # Logic for subsequent daily snapshots (Post-Migration)
    # The 'version' variable helps track what has already been processed
    df_daily = spark.read.format("cloudFiles").load("/path/to/daily/snapshots")
    return (df_daily, "v2_plus")

dp.create_auto_cdc_from_snapshot_flow(
    target = "customer_scd2",
    source = get_snapshot,
    keys = ["customer_id"],
    sequence_by = "valid_from", # Use your legacy valid_from as the sequence
    stored_as_scd_type = 2
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;If your legacy data is already in SCD2 format (multiple rows per ID with valid_to dates), make sure your sequence_by column is unique and strictly increasing per key so the flow can correctly order the history during the initial load.&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Wed, 06 May 2026 09:50:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/scd2-table-migration-using-lakeflow/m-p/156237#M54391</guid>
      <dc:creator>rishav_sharma</dc:creator>
      <dc:date>2026-05-06T09:50:07Z</dc:date>
    </item>
    <item>
      <title>Re: SCD2 table migration using LakeFlow</title>
      <link>https://community.databricks.com/t5/data-engineering/scd2-table-migration-using-lakeflow/m-p/156248#M54392</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116051"&gt;@peter_hoeltschi&lt;/a&gt;&amp;nbsp;!&lt;/P&gt;&lt;P&gt;I think your workspace or user permissions only allow SQL/serverless compute&amp;nbsp;not classi clusters because even if you have enterprise pay-as-you-go at the account level does not automatically mean every user can create all-purpose compute in every workspace.&lt;/P&gt;&lt;P&gt;Ask a workspace admin to check that you have workspace and allow unrestricted cluster creation is checked&amp;nbsp;or at least access to a compute policy such as personal compute&lt;/P&gt;&lt;P&gt;Databricks docs say this entitlement controls whether a user can provision unrestricted compute, and without it users can only create compute through assigned policies. &lt;A title="https://docs.databricks.com/aws/en/security/auth/entitlements" href="https://docs.databricks.com/aws/en/security/auth/entitlements" target="_self"&gt;https://docs.databricks.com/aws/en/security/auth/entitlements&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 06 May 2026 13:02:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/scd2-table-migration-using-lakeflow/m-p/156248#M54392</guid>
      <dc:creator>amirabedhiafi</dc:creator>
      <dc:date>2026-05-06T13:02:26Z</dc:date>
    </item>
  </channel>
</rss>

