cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Capturing deletes for SCD2 using apply changes or apply as delete decorator

KG_777
New Contributor

We're looking to implement scd2 for tables in our lakehouse and we need to keep track of records that are being deleted in the source. Does anyone have a similar use case and can they outline some of the challenges they faced and workarounds they implemented for the process?

1 ACCEPTED SOLUTION

Accepted Solutions

lingareddy_Alva
Honored Contributor II

Hi @KG_777 

Tracking deleted records in an SCD Type 2 implementation for a lakehouse architecture is indeed a challenging but common requirement.
Here's an overview of approaches, challenges, and workarounds based on industry experience:

Common Approaches:
1. Soft Delete Flags
The most straightforward approach is implementing a "current record" flag plus a "deleted" flag:
-- IsActive: Boolean indicating if this is the current version
-- IsDeleted: Boolean indicating if the record has been logically deleted
-- EffectiveStartDate: When the record version became active
-- EffectiveEndDate: When the record version became inactive

2. Change Data Capture (CDC) Integration
If your source systems support CDC:
-- Configure CDC to capture delete operations
-- Process these delete events in your lakehouse pipeline
-- Update existing records rather than actually removing data

3. Full Source Comparison
-- Take periodic full snapshots of source data
-- Compare with previous full snapshot to identify deletes
-- Mark records not found in new snapshot as deleted

Challenges & Workarounds
Challenge 1: Identifying Deletes Without Source System Support
Challenge: Many source systems don't track or expose deletions reliably.

Workarounds:
-- Implement hash comparison of full datasets between runs
-- Use a window function to identify the latest version of each business key, then compare across loads
-- Store a "last seen date" and periodically mark records not seen for X days as potentially deleted.

Challenge 2: Performance at Scale
Challenge: Full comparison operations become expensive at scale.
Workarounds:
-- Partition your comparison logic by date ranges or other natural partitions
-- Use Spark's optimized join operations with appropriate broadcast hints
-- Implement incremental processing that focuses only on changed partitions
-- Use Delta Lake's MERGE command or Databricks' optimized operations.

Challenge 3: Late-Arriving Data
Challenge: Data might arrive out of sequence, causing false deletion flags.
Workarounds:
-- Implement a "quarantine" period before confirming deletions
-- Add versioning metadata to track confidence in deletion status
-- Use time-based windowing to handle out-of-order events


Challenge 4: Reconciliation and Recovery
Challenge: Deleted records might reappear later.
Workarounds:
-- Keep a separate "deletion log" table tracking when and why records were marked deleted
-- Implement "undelete" logic that can reactivate records with proper history
-- Design schema to support resurrection of previously deleted entities.

 

LR

View solution in original post

1 REPLY 1

lingareddy_Alva
Honored Contributor II

Hi @KG_777 

Tracking deleted records in an SCD Type 2 implementation for a lakehouse architecture is indeed a challenging but common requirement.
Here's an overview of approaches, challenges, and workarounds based on industry experience:

Common Approaches:
1. Soft Delete Flags
The most straightforward approach is implementing a "current record" flag plus a "deleted" flag:
-- IsActive: Boolean indicating if this is the current version
-- IsDeleted: Boolean indicating if the record has been logically deleted
-- EffectiveStartDate: When the record version became active
-- EffectiveEndDate: When the record version became inactive

2. Change Data Capture (CDC) Integration
If your source systems support CDC:
-- Configure CDC to capture delete operations
-- Process these delete events in your lakehouse pipeline
-- Update existing records rather than actually removing data

3. Full Source Comparison
-- Take periodic full snapshots of source data
-- Compare with previous full snapshot to identify deletes
-- Mark records not found in new snapshot as deleted

Challenges & Workarounds
Challenge 1: Identifying Deletes Without Source System Support
Challenge: Many source systems don't track or expose deletions reliably.

Workarounds:
-- Implement hash comparison of full datasets between runs
-- Use a window function to identify the latest version of each business key, then compare across loads
-- Store a "last seen date" and periodically mark records not seen for X days as potentially deleted.

Challenge 2: Performance at Scale
Challenge: Full comparison operations become expensive at scale.
Workarounds:
-- Partition your comparison logic by date ranges or other natural partitions
-- Use Spark's optimized join operations with appropriate broadcast hints
-- Implement incremental processing that focuses only on changed partitions
-- Use Delta Lake's MERGE command or Databricks' optimized operations.

Challenge 3: Late-Arriving Data
Challenge: Data might arrive out of sequence, causing false deletion flags.
Workarounds:
-- Implement a "quarantine" period before confirming deletions
-- Add versioning metadata to track confidence in deletion status
-- Use time-based windowing to handle out-of-order events


Challenge 4: Reconciliation and Recovery
Challenge: Deleted records might reappear later.
Workarounds:
-- Keep a separate "deletion log" table tracking when and why records were marked deleted
-- Implement "undelete" logic that can reactivate records with proper history
-- Design schema to support resurrection of previously deleted entities.

 

LR

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now