Databricks Community

PradeepNarvekar · Friday

Let Me Guess What Happened to You
You built a solid data pipeline. It runs every day, ingests a few gigabytes, everything looks fine. Then one morning you open your cloud storage bill — AWS S3, Azure ADLS, or Google Cloud Storage — and something feels very wrong.
Your table receives maybe 2–3 GB of new data per day. But the total storage sitting on disk is 200 GB. Or 400 GB. Or more.
You didn't make a mistake in your pipeline. Your data isn't duplicated. Everything is "working."
And yet your storage keeps growing. Every. Single. Day.
This is one of the most common surprises people run into with Delta Lake, and the good news is — it has a very simple explanation and a very simple fix. Let me walk you through both.

First, Understand Why Delta Lake Keeps Old Data
Delta Lake is built on a concept called ACID transactions. This means every write to your table — whether it's an INSERT, UPDATE, DELETE, or MERGE — is tracked and logged so that you always have a consistent, reliable view of your data.
To make this work, Delta Lake uses something called the transaction log (stored in the _delta_log folder inside your table directory). Every time you change data, Delta Lake does not go back and modify existing files. Instead, it:

Writes new files with the updated or new data
Marks the old files as deleted in the transaction log
Keeps the old files sitting on disk — just in case

That last point is the key. Delta Lake keeps the old, "deleted" files on disk because of a very useful feature called Time Travel.

What Is Time Travel and Why Does It Cause Storage to Grow?
Time Travel lets you query older versions of your table. For example:
%sql-- See what your table looked like 3 days ago
SELECT * FROM my_table TIMESTAMP AS OF '2026-05-26';

-- Or go back to a specific version number
SELECT * FROM my_table VERSION AS OF 10;
This is genuinely useful. You can recover from accidental deletes, audit historical data, compare results across time, or roll back a bad pipeline run.
But to make Time Travel possible, Delta Lake has to keep all the old files around. It cannot delete them, because if it did, you wouldn't be able to go back in time.
So here's the reality of what happens over time:

Day 1: You load 2 GB of data → 2 GB on disk
Day 2: You update some records → Delta writes new files, keeps old files → now maybe 4–5 GB on disk
Day 7: You run a MERGE → more new files, old files still there → 15–20 GB on disk
Day 30: Files have been piling up for a month → you're at 80–100 GB even though your actual data is still just a few GB

Nobody warned you. Delta Lake was doing exactly what it was designed to do. But now you're paying for it.

The Fix: Two Commands You Need to Know
Command 1: VACUUM — The File Cleaner
VACUUM physically deletes old data files from your storage that are no longer needed for the current version of your table.
%sql
VACUUM your_table_name RETAIN 168 HOURS;
Let's break this down:

VACUUM — tells Delta Lake to scan the table directory and find files marked as deleted
your_table_name — the name of your Delta table
RETAIN 168 HOURS — keep files that are less than 168 hours (7 days) old, delete everything older

The 168-hour (7-day) retention window is the default, and it's there for a reason. If you or your team ever uses Time Travel to look back at old data, you need those old files to still exist. 7 days is a safe buffer for most teams.
What happens if you run VACUUM without RETAIN?
Delta Lake will refuse to run it unless you explicitly turn off a safety check. That's by design. Deleting files with too short a retention window can break Time Travel queries mid-execution. Always be intentional about how far back you need to go.
In PySpark, you can also run it like this:
%python
from delta.tables import DeltaTable

delta_table = DeltaTable.forName(spark, "your_table_name")
delta_table.vacuum(retentionHours=168)
What to expect: The first time you run VACUUM on a table that's never been cleaned, you might see storage drop dramatically — from hundreds of GB down to just what you actually need. One person in the community reported going from 400 GB down to 35 GB on the first run.

Command 2: OPTIMIZE — The File Consolidator
Even after VACUUM cleans up old files, there's a second problem: small file fragmentation.
Think about what happens when you have a streaming pipeline writing data every 5 minutes. Each micro-batch writes a tiny Parquet file — maybe 1 MB or 5 MB. After a week, you might have thousands of tiny files sitting in your table directory.
When a query runs against a fragmented table, Spark has to open and read each of those tiny files individually. Opening files has overhead. Reading thousands of small files is much slower than reading a handful of large ones.
OPTIMIZE fixes this by merging small files into larger, well-sized files:
%sql
OPTIMIZE your_table_name;
That's the basic version. It will compact small files into files around 1 GB each, which is the sweet spot for Spark to work efficiently.
Want to go a step further? Add Z-ORDER:
%sql
OPTIMIZE your_table_name ZORDER BY (date, region);
Z-ORDER physically reorganizes the data within the files so that rows with similar values for date and region are stored close together. This means when a query filters on those columns, Spark can skip large portions of the data entirely instead of scanning everything.
Use Z-ORDER on the columns you filter on most in your queries — typically dates, IDs, regions, or status fields.

The Right Order to Run These Commands
Always run OPTIMIZE before VACUUM. Here's why:
OPTIMIZE creates new, larger files and marks the old small files as deleted. VACUUM then cleans up those deleted small files. If you run VACUUM first and then OPTIMIZE, VACUUM won't have anything extra to clean up yet.
%sql-- Step 1: Compact and reorganize files
OPTIMIZE your_table_name ZORDER BY (your_filter_column);

-- Step 2: Clean up the old files that OPTIMIZE marked as deleted
VACUUM your_table_name RETAIN 168 HOURS;

How Often Should You Run These?
This depends on how active your table is, but here's a general rule of thumb:
Table ActivityOPTIMIZE FrequencyVACUUM FrequencyHeavy writes (streaming, hourly jobs)DailyWeeklyModerate writes (daily batch)WeeklyWeeklyLight writes (occasional loads)MonthlyMonthly
A simple way to handle this is to create a Databricks Workflow job that runs these two commands on a schedule. Set it up once, forget about it, and let it run in the background.

What About Predictive Optimization?
If you're on Databricks with Unity Catalog managed tables, there's an even easier option: Predictive Optimization.
When enabled, Databricks automatically decides when to run OPTIMIZE and VACUUM on your tables based on usage patterns. You don't have to schedule anything manually.
You can enable it at the catalog or schema level:
%sql
ALTER CATALOG my_catalog
SET DBPROPERTIES ('delta.predictiveOptimization.enabled' = 'true');
This is the "set it and forget it" version of table maintenance. For teams managing many tables, it's genuinely useful — it removes the operational burden of scheduling maintenance jobs for every table.

A Few Things to Watch Out For
Don't set retention too low. If your team uses Time Travel regularly, make sure your VACUUM retention window is longer than how far back anyone is likely to query. 7 days is a good minimum, but some teams use 30 days for compliance reasons.
VACUUM is not instant on large tables. It needs to scan every file in the table directory. On very large tables with lots of history, it can take a while. Run it during off-peak hours.
Z-ORDER on the wrong columns wastes time. Only Z-ORDER on columns you actually filter on frequently. Adding too many columns to Z-ORDER slows down the OPTIMIZE operation without helping queries.
External tables need careful handling. VACUUM on external tables deletes files from your external storage location. Make sure you understand what's there before running it for the first time.

Quick Summary
ProblemCauseFixStorage keeps growingDelta keeps old file versions for Time TravelVACUUM to delete old filesQueries are slowToo many small files from frequent writesOPTIMIZE to compact filesQueries scan too much dataData is not organized by filter columnsOPTIMIZE with Z-ORDERDon't want to maintain manuallyManaging many tables is tediousEnable Predictive Optimization

The Bottom Line
Delta Lake's file retention behavior is not a bug — it's what makes Time Travel, rollbacks, and ACID guarantees possible. But if you never clean up, you'll pay for storage you don't need and suffer queries that are slower than they should be.
Two commands. Run them on a schedule. Your storage bill will thank you.
%sql
OPTIMIZE your_table ZORDER BY (your_most_filtered_column);
VACUUM your_table RETAIN 168 HOURS;
If you're managing more than a handful of tables, look into Predictive Optimization in Unity Catalog — it handles all of this automatically so you can focus on building pipelines instead of babysitting tables.

Hope this helps someone out there who opened their cloud bill this morning and had a small heart attack. Drop any questions in the comments — happy to help troubleshoot specific scenarios.

Databricks Community

Your Delta Lake Table Is Secretly Ballooning — Here's the 2-Command Fix

DAIS 2026 Speaker Spotlight Series #11 | Krishan Kumar

FLASH SALE: Save 50% on Summit Training ⚡

🌟 Community Pulse: Your Weekly Roundup! May 18 – 24, 2026

Community BrickTalk: Using AI to Navigate Unfamiliar Business Data

Solution Accelerator Series | Survival Analysis for Churn and Lifetime Value