cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

I have a question about the VACUUM feature!

chorongs
New Contributor III

chorongs_0-1688456804185.png

History is piled up as above

For testing, I want to erase the history of the table with the VACUUM command.

"set spark.databricks.delta.retentionDurationCheck.After the option "enabled = False" was given, the command "VACUUM del_park retain 0 hours;" was used, but the history remained unchanged

I want to erase history based on 0 hours, what should I do?

 

1 ACCEPTED SOLUTION

Accepted Solutions

Vinay_M_R
Valued Contributor II
Valued Contributor II

Executing VACUUM performs garbage cleanup on the table directory. By default, a retention threshold of 7 days will be enforced.

 

Please follow the below steps to perform VACCUM:

 

1.) SET spark.databricks.delta.retentionDurationCheck.enabled false; This command overrides the retention threshold check to allow us to demonstrate permanent removal of data.

NOTE: Vacuuming a production table with a short retention can lead to data corruption and/or failure of long-running queries and extreme caution should be used when disabling this setting.

 

 2.) Before permanently deleting data files, review them manually using the DRY RUN option:

All data files not in the current version of the table will be shown in the preview above.

VACUUM beans RETAIN 0 HOURS DRY RUN

 

3.) Run the command again without DRY RUN to permanently delete these files:

VACUUM beans RETAIN 0 HOURS

NOTE: All previous versions of the table will no longer be accessible.

 

Because VACUUM can be such a destructive act for important datasets, it's always a good idea to turn the retention duration check back on. Run the cell below to reactive this setting: spark.databricks.delta.retentionDurationCheck.enabled true;

 

Important note: Because Delta Cache stores copies of files queried in the current session on storage volumes deployed to your currently active cluster, you may still be able to temporarily access previous table versions.

 

Restarting the cluster will ensure that these cached data files are permanently purged. After restarting the cluster, query your table again to confirm that you don't have access to the previous table versions.

View solution in original post

4 REPLIES 4

Aviral-Bhardwaj
Esteemed Contributor III

I think 0 is not possible by default it is 7 days

VaibB
Contributor

Could you please try below:

1) spark.databricks.delta.retentionDurationCheck.enabled to false.
2) Vacuum with location e.g.
VACUUM delta.`/data/events/` RETAIN 100 HOURS  -- vacuum files not required by versions more than 100 hours old

 

 

Vinay_M_R
Valued Contributor II
Valued Contributor II

Executing VACUUM performs garbage cleanup on the table directory. By default, a retention threshold of 7 days will be enforced.

 

Please follow the below steps to perform VACCUM:

 

1.) SET spark.databricks.delta.retentionDurationCheck.enabled false; This command overrides the retention threshold check to allow us to demonstrate permanent removal of data.

NOTE: Vacuuming a production table with a short retention can lead to data corruption and/or failure of long-running queries and extreme caution should be used when disabling this setting.

 

 2.) Before permanently deleting data files, review them manually using the DRY RUN option:

All data files not in the current version of the table will be shown in the preview above.

VACUUM beans RETAIN 0 HOURS DRY RUN

 

3.) Run the command again without DRY RUN to permanently delete these files:

VACUUM beans RETAIN 0 HOURS

NOTE: All previous versions of the table will no longer be accessible.

 

Because VACUUM can be such a destructive act for important datasets, it's always a good idea to turn the retention duration check back on. Run the cell below to reactive this setting: spark.databricks.delta.retentionDurationCheck.enabled true;

 

Important note: Because Delta Cache stores copies of files queried in the current session on storage volumes deployed to your currently active cluster, you may still be able to temporarily access previous table versions.

 

Restarting the cluster will ensure that these cached data files are permanently purged. After restarting the cluster, query your table again to confirm that you don't have access to the previous table versions.

chorongs
New Contributor III

The test was successful Thank you!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.