cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best Practices for Implementing Automated, Scalable, and Auditable Purge Mechanism on Azure Databric

Phani1
Databricks MVP

 

Hi All, I'm looking to implement an automated, scalable, and auditable purge mechanism on Azure Databricks to manage data retention, deletion and archival policies across our Unity Catalog-governed Delta tables.

I've come across various approaches, such as using VACUUM, REORG, and custom scripts, but I'd like to know if there are any best practices or recommended solutions for implementing such a mechanism.

Specifically, I'm interested in learning about: across all layers (Bronze, Silver, and Gold) for retention ,deletion and archival policies

I'd appreciate any guidance, advice, or best practices you can share on implementing an automated, scalable, and auditable purge mechanism on Azure Databricks.

Regards,

Phani

3 REPLIES 3

Sumit_7
Honored Contributor II

@Phani1  Check my POV:

- Follow the Delta purge lifecycle: DELETE โ†’ REORG TABLE APPLY (PURGE) โ†’ VACUUM
- Metadata + Automation: use a control table + Databricks Workflows for scalable, policy-based execution.
- Retention by layer + audit centrally: Bronze (long), Silver (controlled), Gold (frequent) with logs for governance.

lingareddy_Alva
Esteemed Contributor

Hi @Phani1 

This is a meaty topic โ€” let me give you a structured breakdown of the full purge/retention framework.

Core framework: layer-by-layer policies

Bronze โ€” raw ingestion layer:
The goal here is preserving source fidelity while enforcing legal/regulatory minimums. Bronze tables typically carry the longest retention window since they serve as the system of record.
Retention: Keep raw data for the duration mandated by your regulatory baseline (often 7 years for financial, 1โ€“2 years for operational telemetry). Use Delta table properties to encode this declaratively.

Deletion (GDPR/CCPA hard deletes): Use REORG TABLE ... APPLY (PURGE) combined with DELETE for row-level deletion without waiting for VACUUM. This rewrites affected files immediately:

Archival before purge: Before hard-deleting Bronze records, archive them to cold Azure Blob Storage (Cool or Archive tier) using a managed EXTERNAL LOCATION in Unity Catalog. This maintains auditability without keeping hot Delta storage.


Silver โ€” cleansed/conformed layer:
Silver is where CDC flows and SCD Type 2 tables live, so the purge logic is more nuanced โ€” you have both current and historical rows per entity.
Retention: Apply a sliding window based on effective_date / end_date columns. A common pattern:

Right-to-erasure (SCD2 problem): When you must purge a data subject, you need to delete all historical versions of their row โ€” not just the current one. A common approach is tokenizing PII at Bronze ingestion, then storing the decryption key in a separate Key Vault-backed secrets table. Erasure = key deletion, not row deletion. This is far simpler to audit and avoids rewriting large Silver tables.
Schema evolution on purge: If the purge removes columns entirely (e.g., dropping a PII field), use schemaEvolutionMode=rescue in your streaming pipeline and version the schema change via a Unity Catalog metadata table, which you've already worked with in your DLT setup.

Gold โ€” aggregated/serving layer:
Gold tables rarely need row-level purges (aggregates don't contain PII directly), but they do need:
   Materialization refresh: After upstream Silver purges, Gold tables must be recomputed or incrementally refreshed     to remove derived PII. DLT pipelines handle this automatically if you trigger a full refresh after a purge event.
   Retention: Short โ€” typically 90โ€“365 days since Gold is a derived layer. Archive older snapshots to Parquet on         ADLS if historical trend reporting is needed.

Automation and orchestration
Build a Databricks Workflow (or DLT pipeline trigger) with these jobs:
1. Purge Request Ingestion: A Bronze Delta table
yourcatalog.governance.purge_requests acts as the queue โ€” populated by your DSAR (Data Subject Access Request) system or a scheduled retention sweep.
2. Purge Execution Job: A parameterized PySpark notebook that
      Reads open requests from the queue
      Executes DELETE + REORG APPLY PURGE per table per subject
      Writes a completion record to an audit log table

3. VACUUM Job: Run separately on a schedule (not per-request) with RETAIN 0 HOURS only after confirming all streaming readers have checkpointed past the file versions being vacuumed:
4. Audit Log: Write every purge action to my_catalog.governance.purge_audit_log โ€” table name, row count deleted, executor, timestamp, job run ID. This is your compliance evidence.

Unity Catalog governance hooks
Since you're operating under Unity Catalog with my_catalog, you can enforce retention policies at the catalog level:
Table tags as policy metadata (retention.days, contains_pii, data_classification)
System tables (system.access.audit) for querying who deleted what and when โ€” this is your built-in audit trail without extra infrastructure
Row filters and column masks to pseudo-anonymize data in Gold while keeping the underlying rows intact for lineage purposes

 

Key design decisions to nail down early
REORG vs VACUUM: REORG TABLE ... APPLY (PURGE) is your right-to-erasure workhorse โ€” it immediately rewrites only the affected data files. VACUUM is a separate background operation that removes orphaned files on a schedule. Never conflate the two in your pipeline logic.
Key-based erasure for PII-heavy tables: For Silver SCD2 tables with many historical rows per data subject, deleting and rewriting rows is expensive at scale. A tokenization approach โ€” storing PII in a separate lookup encrypted under Azure Key Vault, then deleting the key at erasure time โ€” is far more operationally scalable and auditably clean. The Delta table rows remain; the data becomes unintelligible.
Streaming reader safety: Before running VACUUM on any table that feeds a streaming pipeline, confirm all active streams have checkpointed past the file versions you're about to delete. A RETAIN 168 HOURS (7 days) floor is a safe default for most streaming workloads. Given your use of transformWithStateInPandas and stateful streaming, this is especially important.
Audit trail completeness: Unity Catalog's system.access.audit captures DELETE operations automatically โ€” lean on that as your immutable audit backbone. Your custom purge_audit_log table then adds business context (request ID, legal basis, data subject identifier hash) that the system table won't have.
DLT pipeline invalidation: After any Bronze or Silver purge, trigger a DLT full refresh on downstream Gold pipelines using the Pipelines API or a Databricks Workflow dependency. Don't rely on incremental refresh catching the deletion propagation.
Happy to go deeper on any specific piece โ€” the tokenization pattern, the REORG workflow structure, audit log schema design, or how to wire the purge workflow into your existing DLT setup.

 

LR

AbhaySingh
Databricks Employee
Databricks Employee

Here is my action plan if it helps!

Phase 1: Foundation
  โ˜ Migrate to UC managed tables (if not already)
  โ˜ Enable Predictive Optimization at catalog level
  โ˜ Set delta.deletedFileRetentionDuration per layer

Phase 2: Retention Policies  
  โ˜ Enable Auto-TTL on Bronze tables (request Private Preview access)
  โ˜ Enable Auto-TTL on Silver tables with appropriate windows
  โ˜ Configure Azure lifecycle policies for archival tiers
  โ˜ Set delta.timeUntilArchived on tables with lifecycle policies

Phase 3: Deletion Workflows
  โ˜ Create GDPR/CCPA control table for deletion requests
  โ˜ Build scheduled Workflow job: DELETE โ†’ REORG PURGE โ†’ VACUUM
  โ˜ Use Materialized Views in Silver/Gold for automatic propagation
  โ˜ Test with VACUUM DRY RUN before production runs

Phase 4: Auditability
  โ˜ Set up dashboard on system.storage.predictive_optimization_operations_history
  โ˜ Create SQL Alerts for TTL/VACUUM failures
  โ˜ Document retention policies per catalog/schema
  โ˜ Build compliance report: tables with/without retention policies

Phase 5: Future (Attribute-Based Retention when possible and if available on platform in future )
  โ˜ Define governed tags for TTL time columns
  โ˜ Apply catalog/schema-level retention policies via ABAC
  โ˜ Monitor via Governance Hub