Re: Best Practices for Implementing Automated, Sca...

lingareddy_Alva · ‎04-04-2026

This is a meaty topic — let me give you a structured breakdown of the full purge/retention framework.

Core framework: layer-by-layer policies

Bronze — raw ingestion layer:
The goal here is preserving source fidelity while enforcing legal/regulatory minimums. Bronze tables typically carry the longest retention window since they serve as the system of record.
Retention: Keep raw data for the duration mandated by your regulatory baseline (often 7 years for financial, 1–2 years for operational telemetry). Use Delta table properties to encode this declaratively.

Deletion (GDPR/CCPA hard deletes): Use REORG TABLE ... APPLY (PURGE) combined with DELETE for row-level deletion without waiting for VACUUM. This rewrites affected files immediately:

Archival before purge: Before hard-deleting Bronze records, archive them to cold Azure Blob Storage (Cool or Archive tier) using a managed EXTERNAL LOCATION in Unity Catalog. This maintains auditability without keeping hot Delta storage.

Silver — cleansed/conformed layer:
Silver is where CDC flows and SCD Type 2 tables live, so the purge logic is more nuanced — you have both current and historical rows per entity.
Retention: Apply a sliding window based on effective_date / end_date columns. A common pattern:

Right-to-erasure (SCD2 problem): When you must purge a data subject, you need to delete all historical versions of their row — not just the current one. A common approach is tokenizing PII at Bronze ingestion, then storing the decryption key in a separate Key Vault-backed secrets table. Erasure = key deletion, not row deletion. This is far simpler to audit and avoids rewriting large Silver tables.
Schema evolution on purge: If the purge removes columns entirely (e.g., dropping a PII field), use schemaEvolutionMode=rescue in your streaming pipeline and version the schema change via a Unity Catalog metadata table, which you've already worked with in your DLT setup.

Gold — aggregated/serving layer:
Gold tables rarely need row-level purges (aggregates don't contain PII directly), but they do need:
Materialization refresh: After upstream Silver purges, Gold tables must be recomputed or incrementally refreshed to remove derived PII. DLT pipelines handle this automatically if you trigger a full refresh after a purge event.
Retention: Short — typically 90–365 days since Gold is a derived layer. Archive older snapshots to Parquet on ADLS if historical trend reporting is needed.

Automation and orchestration
Build a Databricks Workflow (or DLT pipeline trigger) with these jobs:
1. Purge Request Ingestion: A Bronze Delta table
yourcatalog.governance.purge_requests acts as the queue — populated by your DSAR (Data Subject Access Request) system or a scheduled retention sweep.
2. Purge Execution Job: A parameterized PySpark notebook that
Reads open requests from the queue
Executes DELETE + REORG APPLY PURGE per table per subject
Writes a completion record to an audit log table

3. VACUUM Job: Run separately on a schedule (not per-request) with RETAIN 0 HOURS only after confirming all streaming readers have checkpointed past the file versions being vacuumed:
4. Audit Log: Write every purge action to my_catalog.governance.purge_audit_log — table name, row count deleted, executor, timestamp, job run ID. This is your compliance evidence.

Unity Catalog governance hooks
Since you're operating under Unity Catalog with my_catalog, you can enforce retention policies at the catalog level:
Table tags as policy metadata (retention.days, contains_pii, data_classification)
System tables (system.access.audit) for querying who deleted what and when — this is your built-in audit trail without extra infrastructure
Row filters and column masks to pseudo-anonymize data in Gold while keeping the underlying rows intact for lineage purposes

Key design decisions to nail down early
REORG vs VACUUM: REORG TABLE ... APPLY (PURGE) is your right-to-erasure workhorse — it immediately rewrites only the affected data files. VACUUM is a separate background operation that removes orphaned files on a schedule. Never conflate the two in your pipeline logic.
Key-based erasure for PII-heavy tables: For Silver SCD2 tables with many historical rows per data subject, deleting and rewriting rows is expensive at scale. A tokenization approach — storing PII in a separate lookup encrypted under Azure Key Vault, then deleting the key at erasure time — is far more operationally scalable and auditably clean. The Delta table rows remain; the data becomes unintelligible.
Streaming reader safety: Before running VACUUM on any table that feeds a streaming pipeline, confirm all active streams have checkpointed past the file versions you're about to delete. A RETAIN 168 HOURS (7 days) floor is a safe default for most streaming workloads. Given your use of transformWithStateInPandas and stateful streaming, this is especially important.
Audit trail completeness: Unity Catalog's system.access.audit captures DELETE operations automatically — lean on that as your immutable audit backbone. Your custom purge_audit_log table then adds business context (request ID, legal basis, data subject identifier hash) that the system table won't have.
DLT pipeline invalidation: After any Bronze or Silver purge, trigger a DLT full refresh on downstream Gold pipelines using the Pipelines API or a Databricks Workflow dependency. Don't rely on incremental refresh catching the deletion propagation.
Happy to go deeper on any specific piece — the tokenization pattern, the REORG workflow structure, audit log schema design, or how to wire the purge workflow into your existing DLT setup.

LR

View solution in original post