cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta Log Files in GCS Not Deleting Automatically Despite Configuration

minhhung0507
New Contributor

Hello Databricks Community,

I am experiencing an issue with Delta Lake where the _delta_log files are not being deleted automatically in GCS bucket, even though I have set the table properties to enable this behavior. Here is the configuration I used:

ALTER TABLE delta.`gs://sample-data`
SET TBLPROPERTIES (
    'retentionDurationCheck.enabled'='false',
    'delta.logRetentionDuration' = 'interval 1 days',
    'delta.deletedFileRetentionDuration' = 'interval 1 days',
    'delta.autoOptimize.optimizeWrite' = 'false',
    'delta.autoOptimize.autoCompact' = 'true',
    'delta.targetFileSize' = '1073741824'
);

Despite these settings, the log files remain in the directory beyond the specified retention period. I understand that log files should be deleted automatically after checkpoint operations, and I have ensured that checkpoints are being created.

Could there be any specific reasons or additional configurations required for these settings to take effect? Is there a known issue with certain environments or configurations that might prevent the automatic deletion of Delta log files?

I appreciate any insights or suggestions from those who have encountered and resolved similar issues.

3 REPLIES 3

VZLA
Databricks Employee
Databricks Employee

Thank you for sharing the details. A couple of key points to clarify and verify in this scenario:

  1. How are you confirming that the _delta_log files should have been deleted? Itโ€™s important to verify that the retention period has indeed elapsed and that checkpoints have been created, as log file cleanup typically occurs after a checkpoint operation.

  2. Have you checked if the _delta_log files in question still reference data files that fall within the retention period? Delta Lake retains logs for files that are still active or could be required for transactional consistency and time travel.

These details will help narrow down whether the issue is with the cleanup mechanism or if the files are still required for data consistency. Let us know, and weโ€™ll assist further!

minhhung0507
New Contributor

Dear @VZLA,

I apologize for the delayed response due to some unforeseen circumstances.Regarding your questions:

  1. To confirm that the _delta_log files should have been deleted, I have been monitoring the retention period and ensuring that checkpoints have been created. However, Iโ€™ve noticed inconsistencies across different tables. In some cases, the cleanup mechanism seems to work as expected, while in others, it does not. This discrepancy is puzzling.
  2. I have checked the _delta_log files in question, and it appears that some still reference data files within the retention period. This leads to uncertainty about whether the logs are being retained for transactional consistency or if there is an issue with the cleanup process itself.

Additionally, Iโ€™ve observed that certain tables create checkpoints after 10 transaction in the delta_log, while others do not. I am unsure why this behavior differs among tables.I appreciate your assistance in narrowing down this issue, and I look forward to your guidance.

VZLA
Databricks Employee
Databricks Employee

Hi, no worries @minhhung0507 .

Check and inspect if the _delta_log files in question still reference data files that fall within the retention period. Delta Lake retains logs for files that are still active or could be required for transactional consistency and time travel.

If you notice inconsistencies across different tables, it might be due to differences in how checkpoints are created or how the retention period is managed for each table, check the table properties and default values wherever these are not set. Ensure that the configurations for retention and checkpointing are consistent across all tables. [1]

Delta Lake's log files are deleted automatically and asynchronously after checkpoint operations. If this is not happening, there might be an issue with the cleanup mechanism itself

Try running the VACUUM command on the Delta table to help in removing data files that are no longer referenced by the table. However, note that the VACUUM command does not govern the deletion of log files. Log files are managed separately and are deleted after checkpoint operations. [1]

[1] https://docs.databricks.com/en/delta/history.html#configure-data-retention-for-time-travel-queries

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group