cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Myths about vacuum command

sangram11
New Contributor

I identified some myths while working with vacuum command spark 3.5.x.

1. vacuum command is not working with days. Instead it's retain clause is asking explicitly to supply values in hours. I tried many times, and it is throwing parse syntax error (why ???).

sangram11_0-1730255825227.png

2. You cannot execute vacuum command if delta.enableChangeDataFeed is enabled. Because it cannot remove files from _change_data folder if it contains parquet files in it.

sangram11_1-1730256066071.png

So, your table history is not deleted by vacuum command is CDF is enabled.

Let me know if you want to pass me some knowledge on vacuum command. Because I feel it is not doing its work as expected.

4 REPLIES 4

saurabh18cs
Contributor II

It is due to the retention Period for Change Data Feed: When CDF is enabled, Databricks retains the change data for a specified period. This retention period ensures that the change data is available for downstream processing and auditing. The VACUUM command respects this retention period and does not delete files that are still within the retention window.

Verify the retention period for Change Data Feed and ensure that the VACUUM command's retention period is greater than or equal to the CDF retention period.

-- Check the current retention period for Change Data Feed
DESCRIBE HISTORY my_delta_table;

-- Adjust the retention period for the VACUUM command
VACUUM my_delta_table RETAIN 168 HOURS;
 
if you want to change the default retention period of change data feed then do this :
ALTER TABLE my_delta_table
SET TBLPROPERTIES ('delta.changeDataFeed.retentionDuration' = '30 days');

 

VZLA
Databricks Employee
Databricks Employee

1. vacuum command is not working with days. Instead it's retain clause is asking explicitly to supply values in hours. I tried many times, and it is throwing parse syntax error (why ???).


Can you please point us out to where it is mentioned that vacuum accepts "days" as a parameter? We may need to have that specific document updated. This is what I was able to find:

https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html#parameters

 

So, your table history is not deleted by vacuum command is CDF is enabled.

To summarize Saurabh's comment, the VACUUM command can still run on a table with CDF enabled, but it will respect the CDF retention period. Files that are within the CDF retention period will not be deleted by VACUUM, ensuring that change data remains available for processing. To avoid conflicts, verify that the VACUUM retention period is greater than or equal to the CDF retention period. Adjust the CDF retention period if necessary using the delta.changeDataFeed.retentionDuration property.

Sangram
New Contributor III

@saurabh18cs & @VZLA : I found this command "vacuum table1 retain 7 days" in many youtube and educational contents. This is very misleading. I found another solution to avoid this.

set delta.databricks.delta.retentionDurationCheck.enabled = false. It works if I want to delete obsolete files whose lifespan is less than default retention duration.

VZLA
Databricks Employee
Databricks Employee

Thanks for reporting this Sangram. Are these youtube and educational contents in the Databricks channel?

> set delta.databricks.delta.retentionDurationCheck.enabled = false. It works if I want to delete obsolete files whose lifespan is less than default retention duration.

That's fine as long as you know this also introduces a risk: any files essential for tracking data changes, maintaining historical versions, or supporting CDF operations could be deleted prematurely. This could result in unintentional data loss, such as loss of previous data states or inability to access certain changes, impacting versioning and downstream processes.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group