cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

I have a question about the VACUUM feature!

chorongs
New Contributor III

chorongs_0-1688456804185.png

History is piled up as above

For testing, I want to erase the history of the table with the VACUUM command.

"set spark.databricks.delta.retentionDurationCheck.After the option "enabled = False" was given, the command "VACUUM del_park retain 0 hours;" was used, but the history remained unchanged

I want to erase history based on 0 hours, what should I do?

 

1 ACCEPTED SOLUTION

Accepted Solutions

Vinay_M_R
Databricks Employee
Databricks Employee

Executing VACUUM performs garbage cleanup on the table directory. By default, a retention threshold of 7 days will be enforced.

 

Please follow the below steps to perform VACCUM:

 

1.) SET spark.databricks.delta.retentionDurationCheck.enabled false; This command overrides the retention threshold check to allow us to demonstrate permanent removal of data.

NOTE: Vacuuming a production table with a short retention can lead to data corruption and/or failure of long-running queries and extreme caution should be used when disabling this setting.

 

 2.) Before permanently deleting data files, review them manually using the DRY RUN option:

All data files not in the current version of the table will be shown in the preview above.

VACUUM beans RETAIN 0 HOURS DRY RUN

 

3.) Run the command again without DRY RUN to permanently delete these files:

VACUUM beans RETAIN 0 HOURS

NOTE: All previous versions of the table will no longer be accessible.

 

Because VACUUM can be such a destructive act for important datasets, it's always a good idea to turn the retention duration check back on. Run the cell below to reactive this setting: spark.databricks.delta.retentionDurationCheck.enabled true;

 

Important note: Because Delta Cache stores copies of files queried in the current session on storage volumes deployed to your currently active cluster, you may still be able to temporarily access previous table versions.

 

Restarting the cluster will ensure that these cached data files are permanently purged. After restarting the cluster, query your table again to confirm that you don't have access to the previous table versions.

View solution in original post

4 REPLIES 4

Aviral-Bhardwaj
Esteemed Contributor III

I think 0 is not possible by default it is 7 days

AviralBhardwaj

VaibB
Contributor

Could you please try below:

1) spark.databricks.delta.retentionDurationCheck.enabled to false.
2) Vacuum with location e.g.
VACUUM delta.`/data/events/` RETAIN 100 HOURS  -- vacuum files not required by versions more than 100 hours old

 

 

Vinay_M_R
Databricks Employee
Databricks Employee

Executing VACUUM performs garbage cleanup on the table directory. By default, a retention threshold of 7 days will be enforced.

 

Please follow the below steps to perform VACCUM:

 

1.) SET spark.databricks.delta.retentionDurationCheck.enabled false; This command overrides the retention threshold check to allow us to demonstrate permanent removal of data.

NOTE: Vacuuming a production table with a short retention can lead to data corruption and/or failure of long-running queries and extreme caution should be used when disabling this setting.

 

 2.) Before permanently deleting data files, review them manually using the DRY RUN option:

All data files not in the current version of the table will be shown in the preview above.

VACUUM beans RETAIN 0 HOURS DRY RUN

 

3.) Run the command again without DRY RUN to permanently delete these files:

VACUUM beans RETAIN 0 HOURS

NOTE: All previous versions of the table will no longer be accessible.

 

Because VACUUM can be such a destructive act for important datasets, it's always a good idea to turn the retention duration check back on. Run the cell below to reactive this setting: spark.databricks.delta.retentionDurationCheck.enabled true;

 

Important note: Because Delta Cache stores copies of files queried in the current session on storage volumes deployed to your currently active cluster, you may still be able to temporarily access previous table versions.

 

Restarting the cluster will ensure that these cached data files are permanently purged. After restarting the cluster, query your table again to confirm that you don't have access to the previous table versions.

chorongs
New Contributor III

The test was successful Thank you!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group