<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Achieving batch-level overwrite for streaming SCD1 in DLT in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/achieving-batch-level-overwrite-for-streaming-scd1-in-dlt/m-p/131224#M49011</link>
    <description>&lt;P&gt;One can achieve this with dlt.apply_changes — but you need to configure it carefully to emulate key-based batch overwrite.&lt;/P&gt;&lt;P&gt;Step 1 — Define Bronze as Streaming Source&lt;/P&gt;&lt;P&gt;import dlt&lt;BR /&gt;from pyspark.sql.functions import col&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97035"&gt;@Dlt&lt;/a&gt;.table(&lt;BR /&gt;comment="Bronze snapshot data"&lt;BR /&gt;)&lt;BR /&gt;def bronze_customers():&lt;BR /&gt;return (&lt;BR /&gt;spark.readStream.format("cloudFiles")&lt;BR /&gt;.option("cloudFiles.format", "parquet")&lt;BR /&gt;.load("dbfs:/mnt/source/snapshots/")&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;Step 2 —&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Use dlt.apply_changes in Silver with apply_as_deletes&lt;BR /&gt;Normally, dlt.apply_changes merges rows based on a primary key.&lt;/P&gt;&lt;P&gt;To do batch-level overwrite, you:&lt;BR /&gt;Set the primary key as (CustomerNumber).&lt;BR /&gt;Provide a sequenceBy column to define recency (e.g., IngestedAt ).&lt;/P&gt;&lt;P&gt;Enable apply_as_deletes="true" so that all rows for a key from an older batch are removed before new rows are inserted.&amp;nbsp;dlt.apply_changes(&lt;BR /&gt;target = "silver_customers",&lt;BR /&gt;source = "bronze_customers",&lt;BR /&gt;keys = ["CustomerNumber", "SalesDate"],&lt;BR /&gt;sequence_by = col("IngestedAt"), # ingestion or snapshot timestamp&lt;BR /&gt;stored_as_scd_type = "1",&lt;BR /&gt;apply_as_deletes = "true"&lt;BR /&gt;)&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Why i think this will work:&lt;/P&gt;&lt;P&gt;If multiple rows exist in the same batch for (CustomerNumber), they are all preserved, since apply_changes will delete the old set and insert the new set (it doesn’t deduplicate within a batch).&lt;/P&gt;&lt;P&gt;This mimics batch-level replacement semantics without custom MERGE logic. Let me know if it works , also if helpful pls mark this as an accepted solution&lt;/P&gt;</description>
    <pubDate>Mon, 08 Sep 2025 11:16:08 GMT</pubDate>
    <dc:creator>ManojkMohan</dc:creator>
    <dc:date>2025-09-08T11:16:08Z</dc:date>
    <item>
      <title>Achieving batch-level overwrite for streaming SCD1 in DLT</title>
      <link>https://community.databricks.com/t5/data-engineering/achieving-batch-level-overwrite-for-streaming-scd1-in-dlt/m-p/131183#M48998</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;I am working with &lt;STRONG&gt;Databricks Delta Live Tables (DLT)&lt;/STRONG&gt; and have the following scenario:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Setup:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Source data is delivered as &lt;STRONG&gt;weekly snapshots&lt;/STRONG&gt; (not CDC).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;I have a &lt;STRONG&gt;bronze layer&lt;/STRONG&gt; (streaming table) and a &lt;STRONG&gt;silver layer&lt;/STRONG&gt; (also streaming).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;I am implementing &lt;STRONG&gt;SCD1 logic&lt;/STRONG&gt; in the silver layer.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Key columns for SCD1 are: CustomerNumber and SalesDate.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Requirement:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Sometimes, if there is an issue with the source data, the provider sends the &lt;STRONG&gt;whole weekly snapshot again&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;In this case, I need to:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Check if rows for the given key combination (CustomerNumber, SalesDate) exist in the silver table.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Delete all existing rows&lt;/STRONG&gt; for that key.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Insert all rows&lt;/STRONG&gt; from the new batch.&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Essentially, I want &lt;STRONG&gt;batch-level overwrite per key&lt;/STRONG&gt; in a streaming silver table.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Constraints:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;The solution must use &lt;STRONG&gt;DLT only&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;I cannot use &lt;STRONG&gt;create_auto_cdc_flow&lt;/STRONG&gt;, because it works at &lt;STRONG&gt;row-level SCD&lt;/STRONG&gt;, not batch-level snapshots&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Question:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Can I achieve this using &lt;STRONG&gt;dlt.apply_changes&lt;/STRONG&gt;?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;If yes, how can I configure it so that &lt;STRONG&gt;all rows in a batch are preserved&lt;/STRONG&gt;, and the &lt;STRONG&gt;previous batch for the same keys is fully replaced&lt;/STRONG&gt; in a streaming table?&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Optional Additional Info:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;My silver table may have multiple rows per key combination in the same batch.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Thanks in advance for any guidance!&lt;/P&gt;</description>
      <pubDate>Sun, 07 Sep 2025 20:40:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/achieving-batch-level-overwrite-for-streaming-scd1-in-dlt/m-p/131183#M48998</guid>
      <dc:creator>GuruRio</dc:creator>
      <dc:date>2025-09-07T20:40:20Z</dc:date>
    </item>
    <item>
      <title>Re: Achieving batch-level overwrite for streaming SCD1 in DLT</title>
      <link>https://community.databricks.com/t5/data-engineering/achieving-batch-level-overwrite-for-streaming-scd1-in-dlt/m-p/131204#M49009</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/183160"&gt;@GuruRio&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;Instead of re writing it in here, here is a full article that talks about handling the deletion of the data in DLT:&amp;nbsp;&lt;BR /&gt;&lt;A href="https://www.databricks.com/blog/handling-right-be-forgotten-gdpr-and-ccpa-using-delta-live-tables-dlt" target="_blank"&gt;https://www.databricks.com/blog/handling-right-be-forgotten-gdpr-and-ccpa-using-delta-live-tables-dlt&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Hope that helps,&lt;BR /&gt;&lt;BR /&gt;Best, Ilir&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2025 08:07:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/achieving-batch-level-overwrite-for-streaming-scd1-in-dlt/m-p/131204#M49009</guid>
      <dc:creator>ilir_nuredini</dc:creator>
      <dc:date>2025-09-08T08:07:36Z</dc:date>
    </item>
    <item>
      <title>Re: Achieving batch-level overwrite for streaming SCD1 in DLT</title>
      <link>https://community.databricks.com/t5/data-engineering/achieving-batch-level-overwrite-for-streaming-scd1-in-dlt/m-p/131224#M49011</link>
      <description>&lt;P&gt;One can achieve this with dlt.apply_changes — but you need to configure it carefully to emulate key-based batch overwrite.&lt;/P&gt;&lt;P&gt;Step 1 — Define Bronze as Streaming Source&lt;/P&gt;&lt;P&gt;import dlt&lt;BR /&gt;from pyspark.sql.functions import col&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97035"&gt;@Dlt&lt;/a&gt;.table(&lt;BR /&gt;comment="Bronze snapshot data"&lt;BR /&gt;)&lt;BR /&gt;def bronze_customers():&lt;BR /&gt;return (&lt;BR /&gt;spark.readStream.format("cloudFiles")&lt;BR /&gt;.option("cloudFiles.format", "parquet")&lt;BR /&gt;.load("dbfs:/mnt/source/snapshots/")&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;Step 2 —&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Use dlt.apply_changes in Silver with apply_as_deletes&lt;BR /&gt;Normally, dlt.apply_changes merges rows based on a primary key.&lt;/P&gt;&lt;P&gt;To do batch-level overwrite, you:&lt;BR /&gt;Set the primary key as (CustomerNumber).&lt;BR /&gt;Provide a sequenceBy column to define recency (e.g., IngestedAt ).&lt;/P&gt;&lt;P&gt;Enable apply_as_deletes="true" so that all rows for a key from an older batch are removed before new rows are inserted.&amp;nbsp;dlt.apply_changes(&lt;BR /&gt;target = "silver_customers",&lt;BR /&gt;source = "bronze_customers",&lt;BR /&gt;keys = ["CustomerNumber", "SalesDate"],&lt;BR /&gt;sequence_by = col("IngestedAt"), # ingestion or snapshot timestamp&lt;BR /&gt;stored_as_scd_type = "1",&lt;BR /&gt;apply_as_deletes = "true"&lt;BR /&gt;)&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Why i think this will work:&lt;/P&gt;&lt;P&gt;If multiple rows exist in the same batch for (CustomerNumber), they are all preserved, since apply_changes will delete the old set and insert the new set (it doesn’t deduplicate within a batch).&lt;/P&gt;&lt;P&gt;This mimics batch-level replacement semantics without custom MERGE logic. Let me know if it works , also if helpful pls mark this as an accepted solution&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2025 11:16:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/achieving-batch-level-overwrite-for-streaming-scd1-in-dlt/m-p/131224#M49011</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-08T11:16:08Z</dc:date>
    </item>
  </channel>
</rss>

