Executing VACUUM performs garbage cleanup on the table directory. By default, a retention threshold of 7 days will be enforced.
Please follow the below steps to perform VACCUM:
1.) SET spark.databricks.delta.retentionDurationCheck.enabled false; This command overrides the retention threshold check to allow us to demonstrate permanent removal of data.
NOTE: Vacuuming a production table with a short retention can lead to data corruption and/or failure of long-running queries and extreme caution should be used when disabling this setting.
2.) Before permanently deleting data files, review them manually using the DRY RUN option:
All data files not in the current version of the table will be shown in the preview above.
VACUUM beans RETAIN 0 HOURS DRY RUN
3.) Run the command again without DRY RUN to permanently delete these files:
VACUUM beans RETAIN 0 HOURS
NOTE: All previous versions of the table will no longer be accessible.
Because VACUUM can be such a destructive act for important datasets, it's always a good idea to turn the retention duration check back on. Run the cell below to reactive this setting: spark.databricks.delta.retentionDurationCheck.enabled true;
Important note: Because Delta Cache stores copies of files queried in the current session on storage volumes deployed to your currently active cluster, you may still be able to temporarily access previous table versions.
Restarting the cluster will ensure that these cached data files are permanently purged. After restarting the cluster, query your table again to confirm that you don't have access to the previous table versions.