โ10-31-2022 05:46 AM
Hello experts. We are trying to clarify how to clean up the large amount of files that are being accumulated in the _delta_log folder (json, crc and checkpoint files). We went through the related posts in the forum and followed the below:
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
ALTER TABLE table_name
SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 1 minutes', 'delta.deletedFileRetentionDuration'='interval 1 minutes');
VACUUM table_name RETAIN 0 HOURS
We understand that each time a checkpoint is written, Databricks automatically cleans up log entries older than the specified retention interval. However, after new checkpoints and commits, all the log files are still there.
Could you please help? Just to mention that it is about tables where we don't need any time travel.
โ10-11-2024 02:47 PM
Hi, have this been fixed later on? We have seen similar issues. Thanks.
โ10-13-2024 03:42 AM
Hi @Brad , @elgeo ,
1. Regarding VACUUM it does not remove log files per documentation:
2. Setting 1 minute as delta.logRetentionDuration is way to low and may not work.
The default is 30 days and there is safety check that prevents setting it below 7 days. More on this in this topic
โ10-13-2024 11:56 AM
That's right, just a small notice: the default threshold for retention period is 7 days.
โ10-13-2024 10:55 PM - edited โ10-13-2024 10:55 PM
โ10-13-2024 11:46 PM - edited โ10-13-2024 11:51 PM
We are both right, but to be specific, I was referring VACUUM command - so effectively, if You run it on a table, by default, the VACUUM command will delete data files older than 7 days from the storage that are no longer referenced by the delta table's transaction log.
So, to make it clear:
delta.deletedFileRetentionDuration - default 7 days, deletes data older than specified retention period - triggered by VACUUM command;
delta.logRetentionDuration - default 30 days, removes logs older than retention period while overwriting the checkpoint file - build-in mechanism, does not need VACUUM;
โ10-13-2024 04:39 PM
Awesome, thanks for response.
โ09-02-2025 10:19 PM
What youโre seeing is expected behavior โ the _delta_log folder always keeps a history of JSON commit files, checkpoint files, and CRCs. Even if you lower delta.logRetentionDuration and run VACUUM, cleanup wonโt happen immediately. A couple of points to note:
The property delta.logRetentionDuration controls how long log history is kept for time travel, but actual cleanup only happens when a new checkpoint is written and retention thresholds are met.
Setting it to something like 1 minute will disable time travel almost immediately, but you still need to wait for the next compaction/checkpoint cycle to actually drop files.
VACUUM only removes data files, not log files โ so it wonโt reduce _delta_log size on its own.
If you really donโt need any history/time travel, the supported approach is to:
Set spark.databricks.delta.retentionDurationCheck.enabled = false.
Use a very small delta.logRetentionDuration (like interval 1 minute).
Trigger a few commits (inserts/updates) so new checkpoints are written.
Delta will then automatically prune older JSON and CRC files beyond the retention window.
Also note that the _delta_log folder will never be completely empty โ at least the most recent checkpoint plus a few commit files are always retained.
Tuesday
I have use Delta Executor for few days ago amazing tool. I get it from delta-executor.my
Wednesday
Delta Lake does automatically clean up _delta_log files (JSON, CHECKPOINT, CRC), but only when two conditions are met:
The retention durations are respected
By default:
delta.logRetentionDuration = 30 days
delta.deletedFileRetentionDuration = 7 days
spark.databricks.delta.retentionDurationCheck.enabled = true (safety check)
A new checkpoint is created after the retention window has passed
Cleanup only happens when a new checkpoint is written, not immediately when properties are changed.
Even though you set:
SET spark.databricks.delta.retentionDurationCheck.enabled = false; ALTER TABLE table_name SET TBLPROPERTIES ( 'delta.logRetentionDuration'='interval 1 minutes', 'delta.deletedFileRetentionDuration'='interval 1 minutes' ); VACUUM table_name RETAIN 0 HOURS;
Delta still won't delete older log files unless:
The retention interval has actually passed in wall-clock time
A new checkpoint is written after the retention window
The table has enough new commits to trigger a checkpoint (usually every 10 commits)
Simply setting the retention to 1 minute does not retroactively delete anything. Delta only evaluates retention at checkpoint-creation time.
VACUUM only removes data files that are no longer referenced.
It never touches the transaction log.
This is why your _delta_log folder still looks large.
If you are not generating new transactions, no cleanup will happen.
They can cause checkpoint conflicts and metadata corruption during concurrent writes.
Make sure retention check is disabled within the cluster that writes the table
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
Set realistic, lowโbut safeโretention, e.g.:
ALTER TABLE table_name SET TBLPROPERTIES ( 'delta.logRetentionDuration'='interval 1 day', 'delta.deletedFileRetentionDuration'='interval 1 day' );
Generate a few commits to trigger a new checkpoint:
df = spark.table("table_name").limit(1)
df.write.mode("append").format("delta").saveAsTable("table_name")
Repeat 10 times to force a checkpoint.
After the new checkpoint, older log files (beyond retention) will be removed automatically.
_delta_log files are not deleted by VACUUM
They are only deleted during checkpoint creation
Changing retention properties does not delete old logs immediately
You must generate commits and allow a checkpoint to be created
Only then will Delta remove logs older than the retention window
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now