cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Clean up _delta_log files

elgeo
Valued Contributor II

Hello experts. We are trying to clarify how to clean up the large amount of files that are being accumulated in the _delta_log folder (json, crc and checkpoint files). We went through the related posts in the forum and followed the below:

SET spark.databricks.delta.retentionDurationCheck.enabled = false;

ALTER TABLE table_name

SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 1 minutes', 'delta.deletedFileRetentionDuration'='interval 1 minutes');

VACUUM table_name RETAIN 0 HOURS

We understand that each time a checkpoint is written, Databricks automatically cleans up log entries older than the specified retention interval. However, after new checkpoints and commits, all the log files are still there.

Could you please help? Just to mention that it is about tables where we don't need any time travel.

9 REPLIES 9

Brad
Contributor II

Hi, have this been fixed later on? We have seen similar issues. Thanks.

filipniziol
Esteemed Contributor

Hi @Brad , @elgeo ,

1. Regarding VACUUM it does not remove log files per documentation:
filipniziol_0-1728815753680.png

2. Setting 1 minute as delta.logRetentionDuration is way to low and may not work.

The default is 30 days and there is safety check that prevents setting it below 7 days. More on this in this topic

radothede
Valued Contributor II

That's right, just a small notice: the default threshold for retention period is 7 days.

filipniziol
Esteemed Contributor

Hi @radothede ,

The default for delta.logRetentionDuration is 30 days as per documentation:

filipniziol_1-1728885273433.png

 

 

radothede
Valued Contributor II

@filipniziol 

We are both right, but to be specific, I was referring VACUUM command - so effectively, if You run it on a table, by default, the VACUUM command will delete data files older than 7 days from the storage that are no longer referenced by the delta table's transaction log.

So, to make it clear:

delta.deletedFileRetentionDuration - default 7 days, deletes data older than specified retention period - triggered by VACUUM command;

delta.logRetentionDuration - default 30 days, removes logs older than retention period while overwriting the checkpoint file - build-in mechanism, does not need VACUUM;

Brad
Contributor II

Awesome, thanks for response.

michaeljac1986
New Contributor II

What youโ€™re seeing is expected behavior โ€” the _delta_log folder always keeps a history of JSON commit files, checkpoint files, and CRCs. Even if you lower delta.logRetentionDuration and run VACUUM, cleanup wonโ€™t happen immediately. A couple of points to note:

  • The property delta.logRetentionDuration controls how long log history is kept for time travel, but actual cleanup only happens when a new checkpoint is written and retention thresholds are met.

  • Setting it to something like 1 minute will disable time travel almost immediately, but you still need to wait for the next compaction/checkpoint cycle to actually drop files.

  • VACUUM only removes data files, not log files โ€” so it wonโ€™t reduce _delta_log size on its own.

If you really donโ€™t need any history/time travel, the supported approach is to:

  1. Set spark.databricks.delta.retentionDurationCheck.enabled = false.

  2. Use a very small delta.logRetentionDuration (like interval 1 minute).

  3. Trigger a few commits (inserts/updates) so new checkpoints are written.

  4. Delta will then automatically prune older JSON and CRC files beyond the retention window.

Also note that the _delta_log folder will never be completely empty โ€” at least the most recent checkpoint plus a few commit files are always retained.

I have use Delta Executor for few days ago amazing tool. I get it from delta-executor.my

iyashk-DB
Databricks Employee
Databricks Employee

Delta Lake does automatically clean up _delta_log files (JSON, CHECKPOINT, CRC), but only when two conditions are met:

  1. The retention durations are respected
    By default:

    • delta.logRetentionDuration = 30 days

    • delta.deletedFileRetentionDuration = 7 days

    • spark.databricks.delta.retentionDurationCheck.enabled = true (safety check)

  2. A new checkpoint is created after the retention window has passed
    Cleanup only happens when a new checkpoint is written, not immediately when properties are changed.


โœ”๏ธ Why files are not being deleted in your case

Even though you set:

SET spark.databricks.delta.retentionDurationCheck.enabled = false;

ALTER TABLE table_name
SET TBLPROPERTIES (
  'delta.logRetentionDuration'='interval 1 minutes',
  'delta.deletedFileRetentionDuration'='interval 1 minutes'
);

VACUUM table_name RETAIN 0 HOURS;

Delta still won't delete older log files unless:

  • The retention interval has actually passed in wall-clock time

  • A new checkpoint is written after the retention window

  • The table has enough new commits to trigger a checkpoint (usually every 10 commits)

Simply setting the retention to 1 minute does not retroactively delete anything. Delta only evaluates retention at checkpoint-creation time.

1. VACUUM does not delete JSON / CHECKPOINT log files

VACUUM only removes data files that are no longer referenced.
It never touches the transaction log.

This is why your _delta_log folder still looks large.

2. _delta_log cleanup only happens during checkpoint creation

If you are not generating new transactions, no cleanup will happen.

3. Very low retention settings (like 1 minute) are not recommended

They can cause checkpoint conflicts and metadata corruption during concurrent writes.

You can force a cleanup safely

  1. Make sure retention check is disabled within the cluster that writes the table

SET spark.databricks.delta.retentionDurationCheck.enabled = false;
  1. Set realistic, lowโ€”but safeโ€”retention, e.g.:

ALTER TABLE table_name
SET TBLPROPERTIES (
  'delta.logRetentionDuration'='interval 1 day',
  'delta.deletedFileRetentionDuration'='interval 1 day'
);
  1. Generate a few commits to trigger a new checkpoint:

df = spark.table("table_name").limit(1)
df.write.mode("append").format("delta").saveAsTable("table_name")

Repeat 10 times to force a checkpoint.

  1. After the new checkpoint, older log files (beyond retention) will be removed automatically.


In Summary:

  • _delta_log files are not deleted by VACUUM

  • They are only deleted during checkpoint creation

  • Changing retention properties does not delete old logs immediately

  • You must generate commits and allow a checkpoint to be created

  • Only then will Delta remove logs older than the retention window