Myths about vacuum command
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-29-2024 07:49 PM
I identified some myths while working with vacuum command spark 3.5.x.
1. vacuum command is not working with days. Instead it's retain clause is asking explicitly to supply values in hours. I tried many times, and it is throwing parse syntax error (why ???).
2. You cannot execute vacuum command if delta.enableChangeDataFeed is enabled. Because it cannot remove files from _change_data folder if it contains parquet files in it.
So, your table history is not deleted by vacuum command is CDF is enabled.
Let me know if you want to pass me some knowledge on vacuum command. Because I feel it is not doing its work as expected.
- Labels:
-
Delta Lake
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-31-2024 04:09 AM
It is due to the retention Period for Change Data Feed: When CDF is enabled, Databricks retains the change data for a specified period. This retention period ensures that the change data is available for downstream processing and auditing. The VACUUM command respects this retention period and does not delete files that are still within the retention window.
Verify the retention period for Change Data Feed and ensure that the VACUUM command's retention period is greater than or equal to the CDF retention period.
ALTER TABLE my_delta_table
SET TBLPROPERTIES ('delta.changeDataFeed.retentionDuration' = '30 days');
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-31-2024 05:28 AM
1. vacuum command is not working with days. Instead it's retain clause is asking explicitly to supply values in hours. I tried many times, and it is throwing parse syntax error (why ???).
Can you please point us out to where it is mentioned that vacuum accepts "days" as a parameter? We may need to have that specific document updated. This is what I was able to find:
https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html#parameters
So, your table history is not deleted by vacuum command is CDF is enabled.
To summarize Saurabh's comment, the VACUUM command can still run on a table with CDF enabled, but it will respect the CDF retention period. Files that are within the CDF retention period will not be deleted by VACUUM, ensuring that change data remains available for processing. To avoid conflicts, verify that the VACUUM retention period is greater than or equal to the CDF retention period. Adjust the CDF retention period if necessary using the delta.changeDataFeed.retentionDuration property.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-31-2024 11:41 PM
@saurabh18cs & @VZLA : I found this command "vacuum table1 retain 7 days" in many youtube and educational contents. This is very misleading. I found another solution to avoid this.
set delta.databricks.delta.retentionDurationCheck.enabled = false. It works if I want to delete obsolete files whose lifespan is less than default retention duration.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-01-2024 01:25 AM
Thanks for reporting this Sangram. Are these youtube and educational contents in the Databricks channel?
> set delta.databricks.delta.retentionDurationCheck.enabled = false. It works if I want to delete obsolete files whose lifespan is less than default retention duration.
That's fine as long as you know this also introduces a risk: any files essential for tracking data changes, maintaining historical versions, or supporting CDF operations could be deleted prematurely. This could result in unintentional data loss, such as loss of previous data states or inability to access certain changes, impacting versioning and downstream processes.

