<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Capturing deletes for SCD2 using apply changes or apply as delete decorator in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/capturing-deletes-for-scd2-using-apply-changes-or-apply-as/m-p/119509#M45897</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/161691"&gt;@KG_777&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Tracking deleted records in an SCD Type 2 implementation for a lakehouse architecture is indeed a challenging but common requirement.&lt;BR /&gt;Here's an overview of approaches, challenges, and workarounds based on industry experience:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Common Approaches:&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;1. Soft Delete Flags&lt;/STRONG&gt;&lt;BR /&gt;The most straightforward approach is implementing a "current record" flag plus a "deleted" flag:&lt;BR /&gt;-- IsActive: Boolean indicating if this is the current version&lt;BR /&gt;-- IsDeleted: Boolean indicating if the record has been logically deleted&lt;BR /&gt;-- EffectiveStartDate: When the record version became active&lt;BR /&gt;-- EffectiveEndDate: When the record version became inactive&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2. Change Data Capture (CDC) Integration&lt;/STRONG&gt;&lt;BR /&gt;If your source systems support CDC:&lt;BR /&gt;-- Configure CDC to capture delete operations&lt;BR /&gt;-- Process these delete events in your lakehouse pipeline&lt;BR /&gt;-- Update existing records rather than actually removing data&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3. Full Source Comparison&lt;/STRONG&gt;&lt;BR /&gt;-- Take periodic full snapshots of source data&lt;BR /&gt;-- Compare with previous full snapshot to identify deletes&lt;BR /&gt;-- Mark records not found in new snapshot as deleted&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Challenges &amp;amp; Workarounds&lt;/STRONG&gt;&lt;BR /&gt;Challenge 1: Identifying Deletes Without Source System Support&lt;BR /&gt;Challenge: Many source systems don't track or expose deletions reliably.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Workarounds:&lt;/STRONG&gt;&lt;BR /&gt;-- Implement hash comparison of full datasets between runs&lt;BR /&gt;-- Use a window function to identify the latest version of each business key, then compare across loads&lt;BR /&gt;-- Store a "last seen date" and periodically mark records not seen for X days as potentially deleted.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Challenge 2: Performance at Scale&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Challenge&lt;/STRONG&gt;: Full comparison operations become expensive at scale.&lt;BR /&gt;&lt;STRONG&gt;Workarounds&lt;/STRONG&gt;:&lt;BR /&gt;-- Partition your comparison logic by date ranges or other natural partitions&lt;BR /&gt;-- Use Spark's optimized join operations with appropriate broadcast hints&lt;BR /&gt;-- Implement incremental processing that focuses only on changed partitions&lt;BR /&gt;-- Use Delta Lake's MERGE command or Databricks' optimized operations.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Challenge 3: Late-Arriving Data&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Challenge&lt;/STRONG&gt;: Data might arrive out of sequence, causing false deletion flags.&lt;BR /&gt;&lt;STRONG&gt;Workarounds&lt;/STRONG&gt;:&lt;BR /&gt;-- Implement a "quarantine" period before confirming deletions&lt;BR /&gt;-- Add versioning metadata to track confidence in deletion status&lt;BR /&gt;-- Use time-based windowing to handle out-of-order events&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Challenge 4: Reconciliation and Recovery&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Challenge&lt;/STRONG&gt;: Deleted records might reappear later.&lt;BR /&gt;&lt;STRONG&gt;Workarounds&lt;/STRONG&gt;:&lt;BR /&gt;-- Keep a separate "deletion log" table tracking when and why records were marked deleted&lt;BR /&gt;-- Implement "undelete" logic that can reactivate records with proper history&lt;BR /&gt;-- Design schema to support resurrection of previously deleted entities.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 16 May 2025 19:52:04 GMT</pubDate>
    <dc:creator>lingareddy_Alva</dc:creator>
    <dc:date>2025-05-16T19:52:04Z</dc:date>
    <item>
      <title>Capturing deletes for SCD2 using apply changes or apply as delete decorator</title>
      <link>https://community.databricks.com/t5/data-engineering/capturing-deletes-for-scd2-using-apply-changes-or-apply-as/m-p/119475#M45887</link>
      <description>&lt;P&gt;We're looking to implement scd2 for tables in our lakehouse and we need to keep track of records that are being deleted in the source. Does anyone have a similar use case and can they outline some of the challenges they faced and workarounds they implemented for the process?&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 14:02:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/capturing-deletes-for-scd2-using-apply-changes-or-apply-as/m-p/119475#M45887</guid>
      <dc:creator>KG_777</dc:creator>
      <dc:date>2025-05-16T14:02:58Z</dc:date>
    </item>
    <item>
      <title>Re: Capturing deletes for SCD2 using apply changes or apply as delete decorator</title>
      <link>https://community.databricks.com/t5/data-engineering/capturing-deletes-for-scd2-using-apply-changes-or-apply-as/m-p/119509#M45897</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/161691"&gt;@KG_777&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Tracking deleted records in an SCD Type 2 implementation for a lakehouse architecture is indeed a challenging but common requirement.&lt;BR /&gt;Here's an overview of approaches, challenges, and workarounds based on industry experience:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Common Approaches:&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;1. Soft Delete Flags&lt;/STRONG&gt;&lt;BR /&gt;The most straightforward approach is implementing a "current record" flag plus a "deleted" flag:&lt;BR /&gt;-- IsActive: Boolean indicating if this is the current version&lt;BR /&gt;-- IsDeleted: Boolean indicating if the record has been logically deleted&lt;BR /&gt;-- EffectiveStartDate: When the record version became active&lt;BR /&gt;-- EffectiveEndDate: When the record version became inactive&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2. Change Data Capture (CDC) Integration&lt;/STRONG&gt;&lt;BR /&gt;If your source systems support CDC:&lt;BR /&gt;-- Configure CDC to capture delete operations&lt;BR /&gt;-- Process these delete events in your lakehouse pipeline&lt;BR /&gt;-- Update existing records rather than actually removing data&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3. Full Source Comparison&lt;/STRONG&gt;&lt;BR /&gt;-- Take periodic full snapshots of source data&lt;BR /&gt;-- Compare with previous full snapshot to identify deletes&lt;BR /&gt;-- Mark records not found in new snapshot as deleted&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Challenges &amp;amp; Workarounds&lt;/STRONG&gt;&lt;BR /&gt;Challenge 1: Identifying Deletes Without Source System Support&lt;BR /&gt;Challenge: Many source systems don't track or expose deletions reliably.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Workarounds:&lt;/STRONG&gt;&lt;BR /&gt;-- Implement hash comparison of full datasets between runs&lt;BR /&gt;-- Use a window function to identify the latest version of each business key, then compare across loads&lt;BR /&gt;-- Store a "last seen date" and periodically mark records not seen for X days as potentially deleted.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Challenge 2: Performance at Scale&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Challenge&lt;/STRONG&gt;: Full comparison operations become expensive at scale.&lt;BR /&gt;&lt;STRONG&gt;Workarounds&lt;/STRONG&gt;:&lt;BR /&gt;-- Partition your comparison logic by date ranges or other natural partitions&lt;BR /&gt;-- Use Spark's optimized join operations with appropriate broadcast hints&lt;BR /&gt;-- Implement incremental processing that focuses only on changed partitions&lt;BR /&gt;-- Use Delta Lake's MERGE command or Databricks' optimized operations.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Challenge 3: Late-Arriving Data&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Challenge&lt;/STRONG&gt;: Data might arrive out of sequence, causing false deletion flags.&lt;BR /&gt;&lt;STRONG&gt;Workarounds&lt;/STRONG&gt;:&lt;BR /&gt;-- Implement a "quarantine" period before confirming deletions&lt;BR /&gt;-- Add versioning metadata to track confidence in deletion status&lt;BR /&gt;-- Use time-based windowing to handle out-of-order events&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Challenge 4: Reconciliation and Recovery&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Challenge&lt;/STRONG&gt;: Deleted records might reappear later.&lt;BR /&gt;&lt;STRONG&gt;Workarounds&lt;/STRONG&gt;:&lt;BR /&gt;-- Keep a separate "deletion log" table tracking when and why records were marked deleted&lt;BR /&gt;-- Implement "undelete" logic that can reactivate records with proper history&lt;BR /&gt;-- Design schema to support resurrection of previously deleted entities.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 May 2025 19:52:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/capturing-deletes-for-scd2-using-apply-changes-or-apply-as/m-p/119509#M45897</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-05-16T19:52:04Z</dc:date>
    </item>
  </channel>
</rss>

