cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
twole
Databricks Employee
Databricks Employee

 

Introduction

Meet Amira, a senior data engineer at a retail analytics company. Every day, she manages several Delta Lake tables that power dashboards, predictive models, and customer behaviour insights. Her team relies on Delta Lake for its reliability and scalability, but with frequent updates, deletes, and streaming jobs, Amira must constantly ensure the tables stay optimised, compliant, and recoverable.

Like many practitioners, she often faces questions such as:

  • “Why is my cloud storage bill increasing?”
  • “How far back can I time-travel this dataset?”
  • “How can I undo an accidental overwrite?”

These challenges ultimately boil down to understanding how Delta Lake manages its files and transaction logs, which form the backbone of a Delta table.

Delta Lake is an open-source storage layer that provides ACID transactions, scalable metadata handling, and unified governance for data lakes. This article explores the anatomy of a Delta table, focusing on how its data files and transaction logs enable operations like time travel, vacuuming, restoring, and change tracking.

If you use managed tables instead of external tables, many of these operations are automated. We will cover the benefits at the end. Still, understanding how they work helps practitioners like Amira fine-tune performance, troubleshoot issues, and maintain historical access.

We’ll follow Amira’s journey as she tackles these key Delta Lake operations:

  • History and Change Data Feed (CDF): Enabling time travel, auditing, and tracking of table changes on a row level.
  • VACUUM: Optimising storage and managing file retention.
  • DROP, RESTORE, and UNDROP: Recovering from data loss

Before we begin to explore this use case with Amina, let us look at the delta settings she will need to keep in mind:

Delta setting

Description

Default value

delta.logRetentionDuration = "interval <interval>"

Controls how long the history for a table is kept.

30 days

delta.deletedFileRetentionDuration = "interval <interval>"

Determines the threshold VACUUM uses to remove data files no longer referenced in the current table version.

7 days

delta.enableChangeDataFeed=true

Allows Databricks to track row-level changes between versions of a Delta table using Change Data Feed (CDF)

False

 

Anatomy of a Delta Lake Table

Amira begins her day investigating why one of her tables — sales_transactions — has ballooned in size. Before diving into optimisation, she reviews what makes up a Delta table.

A Delta table’s folder structure includes:

  • A top-level directory representing the table itself
  • A _delta_log directory containing transaction log files

The actual data files (e.g., file1.parquet, file2.parquet)

twole_0-1762939543984.png

Anatomy of a delta lake table. Source dennyglee

Understanding this structure helps Amira identify which files Delta Lake manages automatically and which may remain after updates or deletions.

Time Travel and Auditing - History

Scenario: One morning, Amira gets a request from compliance:

  • “Can you show what the sales_transactions table looked like last Monday?”

Using Delta Lake’s history and time travel features, she can easily query historical versions.

Each table modification creates a new version, which is recorded in the transaction log. The command below helps Amira inspect all changes:

DESCRIBE HISTORY sales_transactions;

twole_1-1762939685658.png

This shows when data was updated, by whom, and how, which is crucial for auditing.

By default, Delta keeps log history for 30 days (delta.logRetentionDuration = '30 days'), though Amira can extend this if audits require longer retention.

To travel back in time, she can query an older version directly:

SELECT * FROM sales_transactions VERSION AS OF 10;

The transaction log is central to all of this. It records every change — from inserts to schema modifications — enabling both data lineage and recovery.

Tracking Incremental Changes - Change Data Feed (CDF)

Scenario: Later that day, Amira’s downstream systems need to know exactly what changes in sales transactions every hour to update analytics dashboards in real-time.

Delta Lake’s Change Data Feed (CDF) makes this possible. It enables Amira to query the row-level changes, including inserts, updates, and deletes, that occurred between two versions of a table. 

Coincidentally, she had already enabled CDF on the table, enabling her to keep a record of those changes:

ALTER TABLE table_name SET TBLPROPERTIES
(delta.enableChangeDataFeed=true)

Now, she is able to track row-level changes between versions of a Delta table;

SELECT * FROM table_changes('sales_transactions', 100, 110);

CDF stores these changes in a _change_data folder under the table directory, working hand-in-hand with the transaction log and version history.

twole_2-1762939864426.png

Anatomy of a delta lake table with CDF enabled. 

However, Amira knows that CDF data is transient. Once old versions are removed by retention policies or VACUUM, the related CDF data disappears too.

To preserve important change data, she creates an incremental process that writes CDF output to a permanent audit table before cleanup jobs run (learn more here).

She also makes sure her retention settings accommodate the lag of streaming jobs:

ALTER TABLE sales_transactions 
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');

This ensures the _change_data files remain available long enough for downstream consumers to process updates, even if streaming is delayed.

Behind the Scenes

Spoiler
We are developing a new enhancement for CDF that will enable its functionality without the need for _change_data files or additional configurations. Further updates will be provided soon.

Optimising Storage - VACUUM 

Scenario: After months of data ingestion, Amira notices that cloud storage costs have surged. She suspects that old data files are still lingering even after being logically deleted.

This is where the VACUUM command helps.

What VACUUM Does

  • Removes files not referenced by the current Delta table state
  • Physically deletes old, unreferenced files once they exceed the retention threshold (default: 7 days)

When Amira runs VACUUM sales_transactions; Delta removes files that are no longer needed, such as outdated Parquet files from previous updates or deletes.

However, she must be careful: if the retention threshold is too short, she could accidentally remove files needed for time travel or streaming jobs.

Adjusting the Retention Period

To align with her team’s data recovery needs, Amira sets a longer retention period:

ALTER TABLE sales_transactions 

SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');

This ensures that deleted files remain accessible for a month, providing enough time to recover from ingestion errors or delayed processing.

Behind the Scenes

Spoiler
The default VACUUM retention period is set to 7 days. This duration strikes a balance between increased cloud storage expenses resulting from extensive data retention and providing users with a reasonable timeframe for data recovery or time-travelling capabilities. This prevents your cloud storage costs from increasing excessively.

Undoing Mistakes - RESTORE

Scenario: A junior data engineer accidentally overwrote Amira’s table with bad data. 

twole_0-1762940381501.pngTable history showing a CREATE OR REPLACE operation, which overwrites the previous table contents

Instead of rebuilding from scratch, Amira simply runs:

RESTORE TABLE sales_transactions TO VERSION AS OF 6;

The RESTORE operation instantly rolls the table back to its pre-error state, version 6, thanks to the retained log and data files.

twole_1-1762940430436.pngTable history showing the RESTORE operation

However, she must ensure that VACUUM has not deleted older data files; otherwise, restoring to those versions would be impossible.

Behind the Scenes

Spoiler
The RESTORE command provides a permanent way to revert a Delta Lake table to an earlier state, unlike time travel, which offers a temporary view of past versions. While time travel allows you to read a previous version of a Delta table, RESTORE effectively undoes changes in the current version without erasing any historical data. This metadata-only operation updates the transaction log to exclude certain files from being read, resetting the table's content to a prior version. Consequently, you can still use time travel to access any previous versions of the table, even after a RESTORE operation. The functionality of the RESTORE command relies on the log files, which maintain the history of the table.

Managing Accidental Deletion - DROP and UNDROP

Scenario: During her team’s routine cleanup, someone mistakenly drops a managed table in Unity Catalog:

DROP TABLE analytics.sales_transactions;

Fortunately, Delta Lake provides a safety net. Within 7 days, Amira can recover the table:

UNDROP TABLE analytics.sales_transactions;

This feature saves her hours of restoration work and prevents data loss across the team.

The Benefits of Unity Catalog Managed Tables

Amira’s overarching goal is to minimise operational toil while maximising performance, governance, and recoverability. Unity Catalog managed tables are the default and recommended table type in Databricks because they are fully governed and automatically optimised, lowering storage and compute costs compared to external tables. 

  • Automatic maintenance and optimisation with Predictive Optimisation: Managed tables can automatically run operations like OPTIMIZE, VACUUM, and ANALYZE, learning from read/write patterns to keep data layout and statistics in peak shape without manual intervention. twole_2-1762940592468.png
  • Lower costs and faster queries: Managed tables use AI-driven optimisation (including clustering and compaction) plus intelligent statistics collection to reduce compute costs and improve query latency across clients. 
  • Always-on metadata caching: Managed tables enable in-memory metadata caching to reduce transaction log reads and accelerate query planning; this is enabled by default for managed tables. 
  • Lifecycle automation and safety nets: After a DROP TABLE on a managed table, Databricks automatically deletes the underlying data from your cloud storage after 8 days; Unity Catalog also supports UNDROP TABLE to recover dropped managed tables within the same window. 
  • Consistent storage location and name-based access: Data files for managed tables are stored in the parent schema or catalog’s managed location, and Databricks recommends name-based access (catalog.schema.table) rather than path-based access to preserve governance and managed features.

When Amira requires direct path-based access to register existing data, or when the lifecycle must be fully self-managed, external tables can be an appropriate solution. Just remember that dropping an external table doesn’t delete the underlying files, and Unity Catalog does not manage external table lifecycle for you.

Conclusion

For Amira and any data practitioner managing Delta tables on a daily basis, understanding how Delta Lake handles data and log retention is essential.

By mastering commands like HISTORY, CDF, VACUUM, and RESTORE, she ensures her tables remain:

  • Clean (no obsolete files consuming storage)

  • Recoverable (from accidental overwrites or deletions)

  • Auditable (for compliance and debugging)

  • Stream-friendly (for continuous data pipelines)

Through her daily experience, Amira demonstrates how Delta Lake’s operational concepts are not just theory; they are vital tools for maintaining reliable, scalable, and efficient data ecosystems.

If you would like to learn more about Delta Lake, managed Unity Catalog tables and more, check out the resources below:

Contributors