Hi @KG_777
Tracking deleted records in an SCD Type 2 implementation for a lakehouse architecture is indeed a challenging but common requirement.
Here's an overview of approaches, challenges, and workarounds based on industry experience:
Common Approaches:
1. Soft Delete Flags
The most straightforward approach is implementing a "current record" flag plus a "deleted" flag:
-- IsActive: Boolean indicating if this is the current version
-- IsDeleted: Boolean indicating if the record has been logically deleted
-- EffectiveStartDate: When the record version became active
-- EffectiveEndDate: When the record version became inactive
2. Change Data Capture (CDC) Integration
If your source systems support CDC:
-- Configure CDC to capture delete operations
-- Process these delete events in your lakehouse pipeline
-- Update existing records rather than actually removing data
3. Full Source Comparison
-- Take periodic full snapshots of source data
-- Compare with previous full snapshot to identify deletes
-- Mark records not found in new snapshot as deleted
Challenges & Workarounds
Challenge 1: Identifying Deletes Without Source System Support
Challenge: Many source systems don't track or expose deletions reliably.
Workarounds:
-- Implement hash comparison of full datasets between runs
-- Use a window function to identify the latest version of each business key, then compare across loads
-- Store a "last seen date" and periodically mark records not seen for X days as potentially deleted.
Challenge 2: Performance at Scale
Challenge: Full comparison operations become expensive at scale.
Workarounds:
-- Partition your comparison logic by date ranges or other natural partitions
-- Use Spark's optimized join operations with appropriate broadcast hints
-- Implement incremental processing that focuses only on changed partitions
-- Use Delta Lake's MERGE command or Databricks' optimized operations.
Challenge 3: Late-Arriving Data
Challenge: Data might arrive out of sequence, causing false deletion flags.
Workarounds:
-- Implement a "quarantine" period before confirming deletions
-- Add versioning metadata to track confidence in deletion status
-- Use time-based windowing to handle out-of-order events
Challenge 4: Reconciliation and Recovery
Challenge: Deleted records might reappear later.
Workarounds:
-- Keep a separate "deletion log" table tracking when and why records were marked deleted
-- Implement "undelete" logic that can reactivate records with proper history
-- Design schema to support resurrection of previously deleted entities.
LR