cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
dhruvkumar1
Databricks Employee
Databricks Employee

Delta Lake has become the foundation of modern data lakehouses. Organizations worldwide rely on it to bring ACID transactions, schema enforcement, and time travel capabilities to their data lakes. Yet there's a significant gap between using Delta Lake and truly understanding it. Most engineers interact with Delta Lake at the API level -- writing DataFrames, running MERGE operations, calling OPTIMIZE. But when performance degrades or queries slow down unexpectedly, troubleshooting becomes guesswork without knowledge of the underlying architecture.

This article bridges that gap. We'll explore how Delta Lake organizes data on storage, how it tracks changes through the transaction log, why updates can be expensive, and how maintenance operations like OPTIMIZE and VACUUM work at a fundamental level. Whether you're designing new tables, debugging slow queries, or optimizing existing pipelines, this knowledge will transform how you work with Delta Lake.

THE FOUNDATION: PARQUET FILES

Every piece of data in Delta Lake lives in Apache Parquet files -- a columnar format where data is organized by column, not by row.

Why columnar matters: In a row-based format (like CSV or JSON), reading one column means reading every column. In Parquet, columns are stored separately. When you query SELECT col_a, col_b FROM table, Spark reads only those two columns. In a 100-column table, that's potentially 98% less I/O. Parquet's real power is in its structure.This makes Parquet a natural fit for analytical (OLAP) workloads, where queries typically aggregate or filter on a few columns across massive datasets.

Flexible compression is another advantage. Because data is stored column by column, Parquet can apply different compression and encoding schemes per column. Integer columns might use delta encoding and bit-packing; string columns might use dictionary encoding where repeated values are replaced with compact references.

A Parquet file contains:

  • Row Groups: Horizontal slices of data (typically 128MB each, configurable). Each row group contains all columns for a subset of rows.
  • Column Chunks: Within each row group, data for each column is stored contiguously. This is what enables reading only the columns you need.
  • Pages: The smallest unit of storage within a column chunk (default ~1MB). This is where compression and encoding happen.
  • Footer: Metadata at the end of the file containing schema, row group locations, and -- most importantly -- column statistics.

The footer is the key to performance. For every column in every row group, Parquet stores: minimum value, maximum value, null count, and number of values. When Spark sees WHERE event_time >= '2024-01-15', it reads the footer first (tiny, fast, and easy to be cached), checks each row group's statistics, and skips any row group where max(event_time) < '2024-01-15'. This is row-group-level data skipping, an important distinction from file-level data skipping that Delta provides through its transaction log (covered below). Files whose row groups can't possibly contain matching data are never read at all.

THE TRANSACTION LOG: DELTA'S SECRET WEAPON

The transaction log is an ordered record of every transaction ever performed on the table. It is the single mechanism through which Delta Lake guarantees ACID transactions and enables time travel. Without it, a Delta table would just be a directory of Parquet files with no coordination, no atomicity, and no way to reconstruct a consistent view of the data at any point in time.

In practice every change to the table -- INSERT, UPDATE, DELETE, MERGE, schema change -- creates a new JSON file in this directory:

_delta_log/
├── 00000000000000000000.json    (commit 0 - table creation)
├── 00000000000000000001.json    (commit 1)
├── 00000000000000000002.json    (commit 2)
├── 00000000000000000010.checkpoint.parquet  (checkpoint)
└── _last_checkpoint

Each commit file contains actions such as:

  • "add" actions: New files added to the table, including file path, size, partition values, and column statistics
  • "remove" actions: Files marked as logically deleted (but not physically deleted yet).
  • "metaData" actions: Schema changes, table properties, etc
  • "commitInfo": Operation type, timestamp, metrics.

Why statistics exist at two levels: Delta stores column statistics in both the Parquet file footer and the transaction log's "add" actions. These serve different purposes. The transaction log statistics let Spark make skipping decisions without opening any Parquet files. When you query a Delta table, Spark reads the transaction log, extracts statistics from "add" actions, evaluates your WHERE clause against those statistics, builds a list of files that might contain matching data, and only then opens those specific Parquet files. Without statistics in the log, Spark would need to open every file just to read its footer. With statistics in the log, Spark knows which files to skip before any data I/O. By default, statistics are collected on the first 32 columns (for Unity Catalog managed tables, statistics collection is handled intelligently by predictive optimization without this 32-column limit).

State reconstruction: how Delta reads a table

When you query a Delta table, Spark doesn't just list the Parquet files in the directory. It reconstructs the table state through a process called log replay: starting from the latest checkpoint (or the beginning of the log), it sequentially applies each commit's actions — adding files from add actions, removing files from remove actions, applying metadata changes. The result is a consistent snapshot at a specific version: the exact set of files to read and the schema that applies.

This same mechanism powers time travel. To see the table at version N, the reader replays the log up to commit N and stops.

Checkpoints speed up replay

Replaying thousands of JSON files to reconstruct table state would be slow. Periodically, Delta creates a checkpoint — a snapshot of the complete table state in Parquet format. A checkpoint consolidates all actions up to that version, with canceled-out actions removed (e.g., a file that was added and then removed is omitted). Databricks optimizes checkpoint frequency automatically based on data size and workload.

Instead of replaying the entire log from commit 0, a reader locates the latest checkpoint, loads it, and only replays the JSON files written after it. The _last_checkpoint file points to the most recent one.

THE IMMUTABILITY PRINCIPLE

Delta Lake never modifies existing Parquet files. Ever. Every write operation -- INSERT, UPDATE, DELETE, MERGE -- creates new files. Old files are marked as "removed" in the transaction log but remain in storage.

This design enables:

  • Atomicity: All transactions either succeed or fail completely. A write either produces a full commit entry with all its files, or it doesn't, no partial states.
  • Consistency: Delta uses optimistic concurrency control to validate changes against concurrent commits, ensuring the table transitions only between valid states.
  • Isolation: Write-serializable isolation for writes; snapshot isolation for reads. Readers always see a consistent snapshot as of the version when their query started.
  • Durability: Once committed, data is permanent. Files live on cloud object storage alongside transaction log entries, inheriting the storage system's durability guarantees.
  • Time Travel: Old table versions remain accessible. You can query any previous version of the table until VACUUM deletes the old files.

But immutability comes with a cost.

COPY-ON-WRITE: THE COST OF UPDATES

When you update rows in Delta Lake, here's what actually happens:

  1. Spark reads the transaction log to find files that might contain matching rows (using statistics)
  2. Spark reads those entire files into memory
  3. Spark applies the update to matching rows
  4. Spark writes ALL rows (both updated AND unmodified) to NEW files
  5. The transaction log records: "remove" old files, "add" new files

This is Copy-on-Write (CoW). The critical insight: even updating 1 row in a 1GB file means reading 1GB and writing 1GB. The entire file is rewritten. This is write amplification -- a small logical change causes a large physical write.

And it's not just single-file impact. If an UPDATE matches rows across 5 files, all 5 files are fully rewritten -- even if only a handful of rows actually changed in each.

Example:

  • File contains 1,000,000 rows
  • UPDATE changes 5 rows
  • Result: Read 1,000,000 rows, write 1,000,000 rows
  • Old file gets logically deleted (marked removed, still on storage)
  • New file contains all 1,000,000 rows with 5 updated

For tables with frequent small updates, this gets expensive fast.

Merge-on-Read (Deletion Vectors the lighter alternative)

The opposite of Copy-on-Write is Merge-on-Read (MoR): instead of rewriting files at write time, modifications are recorded as lightweight metadata and merged at read time. In Delta Lake, this strategy is implemented through Deletion Vectors.

When deletion vectors are enabled, DELETE, UPDATE, and MERGE operations mark affected rows as modified without rewriting the Parquet file. A compact bitmap (the deletion vector) records which row positions are invalidated. For updates, new versions of those rows go in a separate small file.

How it works:

  • Original data file stays untouched
  • A small deletion vector file records which row positions are invalidated
  • For updates, the old row is marked as deleted via the Deletion Vector and the new version of the row is written to a separate file
  • Readers apply the deletion vector at query time to filter out invalidated rows

Trade-off: Much faster writes (no file rewrites), slightly slower reads (must check deletion vectors), eventual compaction needed (OPTIMIZE merges everything).

Enable with:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableDeletionVectors' = 'true');

Databricks recommends DBR 14.3 LTS and above for full optimization support with deletion vectors.

HOW MERGE WORKS INTERNALLY

MERGE is Delta Lake's most complex operation. Understanding its phases explains why it can be slow:

Phase 1 - Find Affected Files: Spark reads the transaction log and extracts file statistics. It evaluates your merge condition against those statistics to identify files that MIGHT contain matching rows. This is where good statistics and smart merge conditions pay off. If your condition includes a filter on a column with good statistics, Spark can skip most files.

Phase 2 - Apply Changes: Spark reads the identified files, joins with your source data, and applies the whenMatched/whenNotMatched logic. Then it writes new files containing the results.

Phase 3 - Commit: The transaction log gets "remove" actions for old files and "add" actions for new files with fresh statistics.

The optimization opportunity: Phase 1 determines how many files are scanned. The key is giving Spark enough information in your merge condition to skip files via statistics (min/max per column stored in the transaction log).

On a table with Liquid Clustering, data is already physically organized by the clustering columns, producing tight, non-overlapping min/max ranges per file. This means filters on clustering columns in your merge condition translate directly into aggressive file skipping -- Spark can eliminate most files from the scan without any manual tuning.

-- Table clustered by (event_time, event_id)
-- Weak: event_id alone may span many files

MERGE INTO target USING source
ON target.event_id = source.event_id

-- Better: adding a filter on a clustering column lets Spark
-- use tight file-level statistics to skip non-matching files

MERGE INTO target USING source
ON target.event_time >= '2024-01-15'
  AND target.event_id = source.event_id

The same principle applies to any table with good file-level statistics, whether achieved through Liquid Clustering, Z-ordering, or naturally ordered data. The mechanism is the same: Spark checks the min/max statistics stored in the transaction log and skips files whose value ranges don't overlap with your filter. The advantage of Liquid Clustering is that it produces these tight ranges automatically and incrementally, without requiring you to manage partitioning schemes or schedule periodic Z-order rewrites.

Check your merge performance with DESCRIBE HISTORY my_table. Look at numTargetFilesRemoved vs numTargetFilesAdded in the operation metrics. The scanTimeMs and rewriteTimeMs metrics tell you how long was spent finding matching files vs. rewriting them. If scan time dominates, your merge condition isn't selective enough.

OPTIMIZE: COMPACTING FILES FOR BETTER PERFORMANCE

Streaming jobs and frequent appends create many small files. Why is this bad?

  • Metadata overhead: Each file = one entry in the transaction log. Thousands of files = slow log parsing.
  • Task overhead: Spark creates one task per file. Thousands of files = thousands of tasks = scheduling overhead.
  • Poor compression: Small files compress less efficiently than large files.
  • Worse data skipping: Statistics are per-file. More files with overlapping ranges = more files to scan.

OPTIMIZE compacts small files. What happens: Delta identifies files smaller than the target size, reads and combines them into larger files, writes new compacted files, and the transaction log records "remove" small files, "add" compacted files. The operation has dataChange = false -- the logical data is identical; only physical layout changes.

What's the target file size? Databricks auto-tunes this based on table size:

Table size

Auto-tuned target file size

< 2.56 TB

256 MB

2.56 TB - 10 TB

256 MB - 1 GB (scales linearly)

> 10 TB

1 GB

You can override this with a fixed target:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.targetFileSize' = '512mb');
  • Larger target size (e.g., 1GB) -- better for large batch reads, fewer files to scan
  • Smaller target size (e.g., 128MB-512MB) -- better for tables with frequent updates/merges, since smaller files mean less data rewritten per Copy-on-Write operation

The trade-off is read performance (fewer large files) vs. write amplification (smaller files = less rewrite cost on updates). Choose based on your table's read/write pattern.

Do you need to run OPTIMIZE manually? It depends:

  • Optimized writes Shuffles data before writing to produce fewer, better-sized files. Always on for MERGE/UPDATE/DELETE. Adds write latency due to shuffle. Not a replacement for OPTIMIZE.
  • Auto compaction runs after writes and handles basic small-file consolidation, but it is not a full replacement for OPTIMIZE. It does not perform Z-ordering or liquid clustering. For tables larger than 1 TB, Databricks recommends still running OPTIMIZE on a schedule.
  • Predictive optimization (for Unity Catalog managed tables) automatically runs both OPTIMIZE and VACUUM when cost-effective. If you're using predictive optimization, you generally don't need scheduled OPTIMIZE jobs.
  • Without either feature, schedule OPTIMIZE to run regularly -- daily is a good starting point.

Scoping OPTIMIZE with WHERE: You don't have to optimize the entire table. Use a WHERE clause to target specific clustering/partition keys:

OPTIMIZE my_table WHERE date >= current_timestamp() - INTERVAL 1 day

This is especially useful for large tables where only recent partitions receive new data.

LIQUID CLUSTERING: THE MODERN APPROACH

Liquid Clustering is the recommended approach for all new tables. It replaces both traditional partitioning and Z-ordering with a simpler, more efficient mechanism. GA support is available on DBR 15.2 and above.

How it works under the hood

Liquid Clustering uses multi-dimensional data locality. The key innovation is that it's stateful -- Delta tracks clustering information in the transaction log, enabling:

  • Incremental clustering: New data is clustered without touching existing data.
  • Flexible column changes: Redefine clustering columns without a full table rewrite.
  • Automatic balancing: Handles data skew and cardinality issues automatically.
  • Clustering on write: For operations like INSERT INTO, CTAS, and COPY INTO, clustering can happen at write time (subject to size thresholds), reducing the need for separate OPTIMIZE runs.
-- Create table with liquid clustering
CREATE TABLE events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id BIGINT,
    event_data STRING
)
USING DELTA
CLUSTER BY (event_time, event_id);

-- Trigger incremental clustering (only new/modified data)
OPTIMIZE events;

-- Change clustering columns without rewriting data
ALTER TABLE events CLUSTER BY (user_id, event_time);

Key constraints and limits

  • You can specify up to four clustering keys.
  • Clustering keys must be columns that have statistics collected.
  • Clustering is not compatible with partitioning or ZORDER. You use one or the other.

VACUUM: RECLAIMING STORAGE

Remember all those "removed" files from updates, deletes, and OPTIMIZE? They're still in storage. Why keep them? Time travel (SELECT * FROM my_table VERSION AS OF 5 or SELECT * FROM my_table TIMESTAMP AS OF '2024-01-15') and concurrent readers (a query started before the write might still be reading old files).

VACUUM physically deletes them:

VACUUM my_table RETAIN 168 HOURS  -- 7 days (the default retention period)

What VACUUM does:

  • Reads transaction log to find "remove" actions older than the retention period.
  • Physically deletes those files from storage.
  • Records audit information in the transaction log (commitInfo entries).

What VACUUM does NOT do:

  • Does not improve query performance (that's OPTIMIZE).
  • Does not compact files.
  • Does not add or remove file-level actions in the transaction log.

After VACUUM: Time travel to versions older than the retention period will fail. Those files are gone.

Safety check: There's a built-in safety check that prevents running VACUUM with a dangerously low retention threshold. The reason is long-running jobs write data files before committing them to the transaction log. If VACUUM deletes those uncommitted files before the job completes, the transaction log will reference files that no longer exist — leading to table corruption and failed reads.

 If you're certain there are no long-running operations, you can override it by setting spark.databricks.delta.retentionDurationCheck.enabled to false. But think twice before doing this.

The typical workflow:

-- Compact files and improve layout
OPTIMIZE my_table ZORDER BY (event_id);

-- After retention period, clean up old versions
VACUUM my_table RETAIN 168 HOURS;

For Unity Catalog managed tables, predictive optimization can run both OPTIMIZE and VACUUM automatically. If enabled, you generally don't need to schedule these yourself.

CONCURRENCY: WHAT HAPPENS WITH SIMULTANEOUS WRITES

Delta Lake uses optimistic concurrency control — multiple writers proceed without acquiring locks, and conflicts are detected at commit time. Each writer goes through three stages:

  1. Read — identify which files need to be modified
  2. Write — stage new data files
  3. Validate and commit — check for conflicts with concurrently committed changes; if none, commit as a new version

Isolation levels: Databricks uses write-serializable isolation by default for table writes and updates, and snapshot isolation for reads. Write serializability ensures that the result of concurrent writes is equivalent to some serial ordering. Snapshot isolation means readers always see a consistent snapshot of the table as of the version when their query started, regardless of concurrent writes. You can configure the isolation level per table via the isolationLevel table property (WriteSerializable or Serializable).

Conflict detection is more nuanced than just checking whether files were modified. Delta also verifies whether new files have been added by concurrent transactions in partitions or ranges that your operation reads. For example, if your DELETE reads a partition and a concurrent INSERT adds new files to that same partition, Delta detects this as a conflict because your operation might have missed those new rows. Liquid clustering supports row level concurrency, and is way better than partitioning.

When a conflict is detected, the writer receives one of several exception types depending on the nature of the conflict:

  • ConcurrentAppendException - a concurrent operation added files in a partition that your operation read
  • ConcurrentDeleteReadException - a concurrent operation deleted a file that your operation read
  • ConcurrentDeleteDeleteException - a concurrent operation deleted a file that your operation also deletes
  • MetadataChangedException - a concurrent operation updated the table metadata
  • ProtocolChangedException - a concurrent operation upgraded the table protocol

KEY TAKEAWAYS

Parquet files store column statistics (min/max/null_count) in the footer. Delta maintains these statistics at a second level in the transaction log so Spark can skip files without opening them.

Delta never modifies files. Updates use Copy-on-Write -- entire files are rewritten even for small changes. This is write amplification. Deletion vectors avoid this by marking rows as invalidated without rewriting files. For MERGE-heavy tables, set tuneFileSizesForRewrites to reduce rewrite cost.

MERGE performance relies heavily on the file-identification phase. Add filters to your merge condition to enable file pruning and data skipping. Check `scanTimeMs` and `rewriteTimeMs` in `DESCRIBE HISTORY` to diagnose bottlenecks.

OPTIMIZE compacts small files. The default target file size is auto-tuned (256MB for most tables, scaling to 1GB for tables over 10TB). Auto compaction helps but is not a full replacement for OPTIMIZE on large tables.

Liquid Clustering is the recommended approach for all new tables. It replaces both partitioning and Z-ordering, is incremental, and allows changing clustering columns without rewriting data. Z-ordering remains relevant for existing partitioned tables on older runtimes.

VACUUM physically deletes old file versions after the retention period (7 days by default). Run it to reclaim storage, but understand you lose time travel to deleted versions. For large tables, consider VACUUM LITE mode.