Databricks Community

KG_777 · ‎05-16-2025

We're looking to implement scd2 for tables in our lakehouse and we need to keep track of records that are being deleted in the source. Does anyone have a similar use case and can they outline some of the challenges they faced and workarounds they implemented for the process?

lingareddy_Alva · ‎05-16-2025

Hi @KG_777

Tracking deleted records in an SCD Type 2 implementation for a lakehouse architecture is indeed a challenging but common requirement.
Here's an overview of approaches, challenges, and workarounds based on industry experience:

Common Approaches:
1. Soft Delete Flags
The most straightforward approach is implementing a "current record" flag plus a "deleted" flag:
-- IsActive: Boolean indicating if this is the current version
-- IsDeleted: Boolean indicating if the record has been logically deleted
-- EffectiveStartDate: When the record version became active
-- EffectiveEndDate: When the record version became inactive

2. Change Data Capture (CDC) Integration
If your source systems support CDC:
-- Configure CDC to capture delete operations
-- Process these delete events in your lakehouse pipeline
-- Update existing records rather than actually removing data

3. Full Source Comparison
-- Take periodic full snapshots of source data
-- Compare with previous full snapshot to identify deletes
-- Mark records not found in new snapshot as deleted

Challenges & Workarounds
Challenge 1: Identifying Deletes Without Source System Support
Challenge: Many source systems don't track or expose deletions reliably.

Workarounds:
-- Implement hash comparison of full datasets between runs
-- Use a window function to identify the latest version of each business key, then compare across loads
-- Store a "last seen date" and periodically mark records not seen for X days as potentially deleted.

Challenge 2: Performance at Scale
Challenge: Full comparison operations become expensive at scale.
Workarounds:
-- Partition your comparison logic by date ranges or other natural partitions
-- Use Spark's optimized join operations with appropriate broadcast hints
-- Implement incremental processing that focuses only on changed partitions
-- Use Delta Lake's MERGE command or Databricks' optimized operations.

Challenge 3: Late-Arriving Data
Challenge: Data might arrive out of sequence, causing false deletion flags.
Workarounds:
-- Implement a "quarantine" period before confirming deletions
-- Add versioning metadata to track confidence in deletion status
-- Use time-based windowing to handle out-of-order events

Challenge 4: Reconciliation and Recovery
Challenge: Deleted records might reappear later.
Workarounds:
-- Keep a separate "deletion log" table tracking when and why records were marked deleted
-- Implement "undelete" logic that can reactivate records with proper history
-- Design schema to support resurrection of previously deleted entities.

LR

View solution in original post

lingareddy_Alva · ‎05-16-2025