Delta Lake has become the foundation of modern data lakehouses. Organizations worldwide rely on it to bring ACID transactions, schema enforcement, and time travel capabilities to their data lakes. Yet there's a significant gap between using Delta Lake and truly understanding it. Most engineers interact with Delta Lake at the API level -- writing DataFrames, running MERGE operations, calling OPTIMIZE. But when performance degrades or queries slow down unexpectedly, troubleshooting becomes guesswork without knowledge of the underlying architecture.
This article bridges that gap. We'll explore how Delta Lake organizes data on storage, how it tracks changes through the transaction log, why updates can be expensive, and how maintenance operations like OPTIMIZE and VACUUM work at a fundamental level. Whether you're designing new tables, debugging slow queries, or optimizing existing pipelines, this knowledge will transform how you work with Delta Lake.
Every piece of data in Delta Lake lives in Apache Parquet files -- a columnar format where data is organized by column, not by row.
Why columnar matters: In a row-based format (like CSV or JSON), reading one column means reading every column. In Parquet, columns are stored separately. When you query SELECT col_a, col_b FROM table, Spark reads only those two columns. In a 100-column table, that's potentially 98% less I/O. Parquet's real power is in its structure.This makes Parquet a natural fit for analytical (OLAP) workloads, where queries typically aggregate or filter on a few columns across massive datasets.
Flexible compression is another advantage. Because data is stored column by column, Parquet can apply different compression and encoding schemes per column. Integer columns might use delta encoding and bit-packing; string columns might use dictionary encoding where repeated values are replaced with compact references.
A Parquet file contains:
The footer is the key to performance. For every column in every row group, Parquet stores: minimum value, maximum value, null count, and number of values. When Spark sees WHERE event_time >= '2024-01-15', it reads the footer first (tiny, fast, and easy to be cached), checks each row group's statistics, and skips any row group where max(event_time) < '2024-01-15'. This is row-group-level data skipping, an important distinction from file-level data skipping that Delta provides through its transaction log (covered below). Files whose row groups can't possibly contain matching data are never read at all.
The transaction log is an ordered record of every transaction ever performed on the table. It is the single mechanism through which Delta Lake guarantees ACID transactions and enables time travel. Without it, a Delta table would just be a directory of Parquet files with no coordination, no atomicity, and no way to reconstruct a consistent view of the data at any point in time.
In practice every change to the table -- INSERT, UPDATE, DELETE, MERGE, schema change -- creates a new JSON file in this directory:
_delta_log/
├── 00000000000000000000.json (commit 0 - table creation)
├── 00000000000000000001.json (commit 1)
├── 00000000000000000002.json (commit 2)
├── 00000000000000000010.checkpoint.parquet (checkpoint)
└── _last_checkpoint
Each commit file contains actions such as:
Why statistics exist at two levels: Delta stores column statistics in both the Parquet file footer and the transaction log's "add" actions. These serve different purposes. The transaction log statistics let Spark make skipping decisions without opening any Parquet files. When you query a Delta table, Spark reads the transaction log, extracts statistics from "add" actions, evaluates your WHERE clause against those statistics, builds a list of files that might contain matching data, and only then opens those specific Parquet files. Without statistics in the log, Spark would need to open every file just to read its footer. With statistics in the log, Spark knows which files to skip before any data I/O. By default, statistics are collected on the first 32 columns (for Unity Catalog managed tables, statistics collection is handled intelligently by predictive optimization without this 32-column limit).
When you query a Delta table, Spark doesn't just list the Parquet files in the directory. It reconstructs the table state through a process called log replay: starting from the latest checkpoint (or the beginning of the log), it sequentially applies each commit's actions — adding files from add actions, removing files from remove actions, applying metadata changes. The result is a consistent snapshot at a specific version: the exact set of files to read and the schema that applies.
This same mechanism powers time travel. To see the table at version N, the reader replays the log up to commit N and stops.
Replaying thousands of JSON files to reconstruct table state would be slow. Periodically, Delta creates a checkpoint — a snapshot of the complete table state in Parquet format. A checkpoint consolidates all actions up to that version, with canceled-out actions removed (e.g., a file that was added and then removed is omitted). Databricks optimizes checkpoint frequency automatically based on data size and workload.
Instead of replaying the entire log from commit 0, a reader locates the latest checkpoint, loads it, and only replays the JSON files written after it. The _last_checkpoint file points to the most recent one.
Delta Lake never modifies existing Parquet files. Ever. Every write operation -- INSERT, UPDATE, DELETE, MERGE -- creates new files. Old files are marked as "removed" in the transaction log but remain in storage.
This design enables:
But immutability comes with a cost.
When you update rows in Delta Lake, here's what actually happens:
This is Copy-on-Write (CoW). The critical insight: even updating 1 row in a 1GB file means reading 1GB and writing 1GB. The entire file is rewritten. This is write amplification -- a small logical change causes a large physical write.
And it's not just single-file impact. If an UPDATE matches rows across 5 files, all 5 files are fully rewritten -- even if only a handful of rows actually changed in each.
Example:
For tables with frequent small updates, this gets expensive fast.
The opposite of Copy-on-Write is Merge-on-Read (MoR): instead of rewriting files at write time, modifications are recorded as lightweight metadata and merged at read time. In Delta Lake, this strategy is implemented through Deletion Vectors.
When deletion vectors are enabled, DELETE, UPDATE, and MERGE operations mark affected rows as modified without rewriting the Parquet file. A compact bitmap (the deletion vector) records which row positions are invalidated. For updates, new versions of those rows go in a separate small file.
How it works:
Trade-off: Much faster writes (no file rewrites), slightly slower reads (must check deletion vectors), eventual compaction needed (OPTIMIZE merges everything).
Enable with:
ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableDeletionVectors' = 'true');
Databricks recommends DBR 14.3 LTS and above for full optimization support with deletion vectors.
MERGE is Delta Lake's most complex operation. Understanding its phases explains why it can be slow:
Phase 1 - Find Affected Files: Spark reads the transaction log and extracts file statistics. It evaluates your merge condition against those statistics to identify files that MIGHT contain matching rows. This is where good statistics and smart merge conditions pay off. If your condition includes a filter on a column with good statistics, Spark can skip most files.
Phase 2 - Apply Changes: Spark reads the identified files, joins with your source data, and applies the whenMatched/whenNotMatched logic. Then it writes new files containing the results.
Phase 3 - Commit: The transaction log gets "remove" actions for old files and "add" actions for new files with fresh statistics.
The optimization opportunity: Phase 1 determines how many files are scanned. The key is giving Spark enough information in your merge condition to skip files via statistics (min/max per column stored in the transaction log).
On a table with Liquid Clustering, data is already physically organized by the clustering columns, producing tight, non-overlapping min/max ranges per file. This means filters on clustering columns in your merge condition translate directly into aggressive file skipping -- Spark can eliminate most files from the scan without any manual tuning.
-- Table clustered by (event_time, event_id)
-- Weak: event_id alone may span many files
MERGE INTO target USING source
ON target.event_id = source.event_id
-- Better: adding a filter on a clustering column lets Spark
-- use tight file-level statistics to skip non-matching files
MERGE INTO target USING source
ON target.event_time >= '2024-01-15'
AND target.event_id = source.event_id
The same principle applies to any table with good file-level statistics, whether achieved through Liquid Clustering, Z-ordering, or naturally ordered data. The mechanism is the same: Spark checks the min/max statistics stored in the transaction log and skips files whose value ranges don't overlap with your filter. The advantage of Liquid Clustering is that it produces these tight ranges automatically and incrementally, without requiring you to manage partitioning schemes or schedule periodic Z-order rewrites.
Check your merge performance with DESCRIBE HISTORY my_table. Look at numTargetFilesRemoved vs numTargetFilesAdded in the operation metrics. The scanTimeMs and rewriteTimeMs metrics tell you how long was spent finding matching files vs. rewriting them. If scan time dominates, your merge condition isn't selective enough.
Streaming jobs and frequent appends create many small files. Why is this bad?
OPTIMIZE compacts small files. What happens: Delta identifies files smaller than the target size, reads and combines them into larger files, writes new compacted files, and the transaction log records "remove" small files, "add" compacted files. The operation has dataChange = false -- the logical data is identical; only physical layout changes.
What's the target file size? Databricks auto-tunes this based on table size:
|
Table size |
Auto-tuned target file size |
|---|---|
|
< 2.56 TB |
256 MB |
|
2.56 TB - 10 TB |
256 MB - 1 GB (scales linearly) |
|
> 10 TB |
1 GB |
You can override this with a fixed target:
ALTER TABLE my_table SET TBLPROPERTIES ('delta.targetFileSize' = '512mb');
The trade-off is read performance (fewer large files) vs. write amplification (smaller files = less rewrite cost on updates). Choose based on your table's read/write pattern.
Do you need to run OPTIMIZE manually? It depends:
Scoping OPTIMIZE with WHERE: You don't have to optimize the entire table. Use a WHERE clause to target specific clustering/partition keys:
OPTIMIZE my_table WHERE date >= current_timestamp() - INTERVAL 1 day
This is especially useful for large tables where only recent partitions receive new data.
Liquid Clustering is the recommended approach for all new tables. It replaces both traditional partitioning and Z-ordering with a simpler, more efficient mechanism. GA support is available on DBR 15.2 and above.
Liquid Clustering uses multi-dimensional data locality. The key innovation is that it's stateful -- Delta tracks clustering information in the transaction log, enabling:
-- Create table with liquid clustering
CREATE TABLE events (
event_id BIGINT,
event_time TIMESTAMP,
user_id BIGINT,
event_data STRING
)
USING DELTA
CLUSTER BY (event_time, event_id);
-- Trigger incremental clustering (only new/modified data)
OPTIMIZE events;
-- Change clustering columns without rewriting data
ALTER TABLE events CLUSTER BY (user_id, event_time);
Remember all those "removed" files from updates, deletes, and OPTIMIZE? They're still in storage. Why keep them? Time travel (SELECT * FROM my_table VERSION AS OF 5 or SELECT * FROM my_table TIMESTAMP AS OF '2024-01-15') and concurrent readers (a query started before the write might still be reading old files).
VACUUM physically deletes them:
VACUUM my_table RETAIN 168 HOURS -- 7 days (the default retention period)
What VACUUM does:
What VACUUM does NOT do:
After VACUUM: Time travel to versions older than the retention period will fail. Those files are gone.
Safety check: There's a built-in safety check that prevents running VACUUM with a dangerously low retention threshold. The reason is long-running jobs write data files before committing them to the transaction log. If VACUUM deletes those uncommitted files before the job completes, the transaction log will reference files that no longer exist — leading to table corruption and failed reads.
If you're certain there are no long-running operations, you can override it by setting spark.databricks.delta.retentionDurationCheck.enabled to false. But think twice before doing this.
The typical workflow:
-- Compact files and improve layout
OPTIMIZE my_table ZORDER BY (event_id);
-- After retention period, clean up old versions
VACUUM my_table RETAIN 168 HOURS;
For Unity Catalog managed tables, predictive optimization can run both OPTIMIZE and VACUUM automatically. If enabled, you generally don't need to schedule these yourself.
Delta Lake uses optimistic concurrency control — multiple writers proceed without acquiring locks, and conflicts are detected at commit time. Each writer goes through three stages:
Isolation levels: Databricks uses write-serializable isolation by default for table writes and updates, and snapshot isolation for reads. Write serializability ensures that the result of concurrent writes is equivalent to some serial ordering. Snapshot isolation means readers always see a consistent snapshot of the table as of the version when their query started, regardless of concurrent writes. You can configure the isolation level per table via the isolationLevel table property (WriteSerializable or Serializable).
Conflict detection is more nuanced than just checking whether files were modified. Delta also verifies whether new files have been added by concurrent transactions in partitions or ranges that your operation reads. For example, if your DELETE reads a partition and a concurrent INSERT adds new files to that same partition, Delta detects this as a conflict because your operation might have missed those new rows. Liquid clustering supports row level concurrency, and is way better than partitioning.
When a conflict is detected, the writer receives one of several exception types depending on the nature of the conflict:
Parquet files store column statistics (min/max/null_count) in the footer. Delta maintains these statistics at a second level in the transaction log so Spark can skip files without opening them.
Delta never modifies files. Updates use Copy-on-Write -- entire files are rewritten even for small changes. This is write amplification. Deletion vectors avoid this by marking rows as invalidated without rewriting files. For MERGE-heavy tables, set tuneFileSizesForRewrites to reduce rewrite cost.
MERGE performance relies heavily on the file-identification phase. Add filters to your merge condition to enable file pruning and data skipping. Check `scanTimeMs` and `rewriteTimeMs` in `DESCRIBE HISTORY` to diagnose bottlenecks.
OPTIMIZE compacts small files. The default target file size is auto-tuned (256MB for most tables, scaling to 1GB for tables over 10TB). Auto compaction helps but is not a full replacement for OPTIMIZE on large tables.
Liquid Clustering is the recommended approach for all new tables. It replaces both partitioning and Z-ordering, is incremental, and allows changing clustering columns without rewriting data. Z-ordering remains relevant for existing partitioned tables on older runtimes.
VACUUM physically deletes old file versions after the retention period (7 days by default). Run it to reclaim storage, but understand you lose time travel to deleted versions. For large tables, consider VACUUM LITE mode.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.