Meet Amira, a senior data engineer at a retail analytics company. Every day, she manages several Delta Lake tables that power dashboards, predictive models, and customer behaviour insights. Her team relies on Delta Lake for its reliability and scalability, but with frequent updates, deletes, and streaming jobs, Amira must constantly ensure the tables stay optimised, compliant, and recoverable.
Like many practitioners, she often faces questions such as:
These challenges ultimately boil down to understanding how Delta Lake manages its files and transaction logs, which form the backbone of a Delta table.
Delta Lake is an open-source storage layer that provides ACID transactions, scalable metadata handling, and unified governance for data lakes. This article explores the anatomy of a Delta table, focusing on how its data files and transaction logs enable operations like time travel, vacuuming, restoring, and change tracking.
If you use managed tables instead of external tables, many of these operations are automated. We will cover the benefits at the end. Still, understanding how they work helps practitioners like Amira fine-tune performance, troubleshoot issues, and maintain historical access.
We’ll follow Amira’s journey as she tackles these key Delta Lake operations:
Before we begin to explore this use case with Amina, let us look at the delta settings she will need to keep in mind:
|
Delta setting |
Description |
Default value |
|
delta.logRetentionDuration = "interval <interval>" |
Controls how long the history for a table is kept. |
30 days |
|
delta.deletedFileRetentionDuration = "interval <interval>" |
Determines the threshold VACUUM uses to remove data files no longer referenced in the current table version. |
7 days |
|
delta.enableChangeDataFeed=true |
Allows Databricks to track row-level changes between versions of a Delta table using Change Data Feed (CDF) |
False |
Amira begins her day investigating why one of her tables — sales_transactions — has ballooned in size. Before diving into optimisation, she reviews what makes up a Delta table.
A Delta table’s folder structure includes:
The actual data files (e.g., file1.parquet, file2.parquet)
Anatomy of a delta lake table. Source dennyglee
Understanding this structure helps Amira identify which files Delta Lake manages automatically and which may remain after updates or deletions.
Scenario: One morning, Amira gets a request from compliance:
Using Delta Lake’s history and time travel features, she can easily query historical versions.
Each table modification creates a new version, which is recorded in the transaction log. The command below helps Amira inspect all changes:
DESCRIBE HISTORY sales_transactions;
This shows when data was updated, by whom, and how, which is crucial for auditing.
By default, Delta keeps log history for 30 days (delta.logRetentionDuration = '30 days'), though Amira can extend this if audits require longer retention.
To travel back in time, she can query an older version directly:
SELECT * FROM sales_transactions VERSION AS OF 10;
The transaction log is central to all of this. It records every change — from inserts to schema modifications — enabling both data lineage and recovery.
Scenario: Later that day, Amira’s downstream systems need to know exactly what changes in sales transactions every hour to update analytics dashboards in real-time.
Delta Lake’s Change Data Feed (CDF) makes this possible. It enables Amira to query the row-level changes, including inserts, updates, and deletes, that occurred between two versions of a table.
Coincidentally, she had already enabled CDF on the table, enabling her to keep a record of those changes:
ALTER TABLE table_name SET TBLPROPERTIES
(delta.enableChangeDataFeed=true)
Now, she is able to track row-level changes between versions of a Delta table;
SELECT * FROM table_changes('sales_transactions', 100, 110);
CDF stores these changes in a _change_data folder under the table directory, working hand-in-hand with the transaction log and version history.
Anatomy of a delta lake table with CDF enabled.
However, Amira knows that CDF data is transient. Once old versions are removed by retention policies or VACUUM, the related CDF data disappears too.
To preserve important change data, she creates an incremental process that writes CDF output to a permanent audit table before cleanup jobs run (learn more here).
She also makes sure her retention settings accommodate the lag of streaming jobs:
ALTER TABLE sales_transactions
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');
This ensures the _change_data files remain available long enough for downstream consumers to process updates, even if streaming is delayed.
Scenario: After months of data ingestion, Amira notices that cloud storage costs have surged. She suspects that old data files are still lingering even after being logically deleted.
This is where the VACUUM command helps.
When Amira runs VACUUM sales_transactions; Delta removes files that are no longer needed, such as outdated Parquet files from previous updates or deletes.
However, she must be careful: if the retention threshold is too short, she could accidentally remove files needed for time travel or streaming jobs.
To align with her team’s data recovery needs, Amira sets a longer retention period:
ALTER TABLE sales_transactions
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');
This ensures that deleted files remain accessible for a month, providing enough time to recover from ingestion errors or delayed processing.
Scenario: A junior data engineer accidentally overwrote Amira’s table with bad data.
Table history showing a CREATE OR REPLACE operation, which overwrites the previous table contents
Instead of rebuilding from scratch, Amira simply runs:
RESTORE TABLE sales_transactions TO VERSION AS OF 6;
The RESTORE operation instantly rolls the table back to its pre-error state, version 6, thanks to the retained log and data files.
Table history showing the RESTORE operation
However, she must ensure that VACUUM has not deleted older data files; otherwise, restoring to those versions would be impossible.
Scenario: During her team’s routine cleanup, someone mistakenly drops a managed table in Unity Catalog:
DROP TABLE analytics.sales_transactions;
Fortunately, Delta Lake provides a safety net. Within 7 days, Amira can recover the table:
UNDROP TABLE analytics.sales_transactions;
This feature saves her hours of restoration work and prevents data loss across the team.
Amira’s overarching goal is to minimise operational toil while maximising performance, governance, and recoverability. Unity Catalog managed tables are the default and recommended table type in Databricks because they are fully governed and automatically optimised, lowering storage and compute costs compared to external tables.
When Amira requires direct path-based access to register existing data, or when the lifecycle must be fully self-managed, external tables can be an appropriate solution. Just remember that dropping an external table doesn’t delete the underlying files, and Unity Catalog does not manage external table lifecycle for you.
For Amira and any data practitioner managing Delta tables on a daily basis, understanding how Delta Lake handles data and log retention is essential.
By mastering commands like HISTORY, CDF, VACUUM, and RESTORE, she ensures her tables remain:
Through her daily experience, Amira demonstrates how Delta Lake’s operational concepts are not just theory; they are vital tools for maintaining reliable, scalable, and efficient data ecosystems.
If you would like to learn more about Delta Lake, managed Unity Catalog tables and more, check out the resources below:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.