cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta lake : delete data from storage manually instead of vacuum

sharat_n
New Contributor

Hi All

We have a unique use case where we are unable to run vacuum to clean our storage space of delta lake tables. 
Since we have data partitioned by date, we plan to delete files older than a certain date directly from storage. 
Could this lead to any corruption of table?
We do not plan to query the old data anyway, so its okay for us if those queries fail with missing files. 
What we do not want is the new date partitions to be impacted. We also actively append new data in these tables. 


1 REPLY 1

Walter_C
Databricks Employee
Databricks Employee

Deleting files older than a certain date directly from storage without using the VACUUM command can lead to potential issues with your Delta Lake tables. Here are the key points to consider:

  1. Corruption Risk: Directly deleting files from storage can lead to table corruption. Delta Lake relies on the transaction log to keep track of all changes to the table, including file deletions. If files are deleted outside of the Delta Lake transaction log, the table's metadata will be out of sync with the actual data files, leading to potential corruption.

  2. Query Failures: As you mentioned, queries on the old data will fail if the files are missing. This is expected since the data files required for those queries will no longer be available.

  3. Impact on New Data: While you plan to append new data actively, directly deleting files can still impact the new data partitions. Delta Lake operations, such as compaction and optimization, rely on the integrity of the transaction log and the presence of all data files. Deleting files manually can interfere with these operations and potentially cause issues with new data.

  4. Best Practices: It is recommended to use the VACUUM command to safely remove old data files. The VACUUM command ensures that only files no longer referenced by the Delta table are deleted, maintaining the integrity of the table. The default retention period for VACUUM is 7 days, but this can be configured based on your requirements.

  5. Configuration Options: If you need to manage the retention of data files, you can adjust the following Delta table properties:

    • delta.logRetentionDuration: Controls how long the history for a table is kept. The default is 30 days.
    • delta.deletedFileRetentionDuration: Determines the threshold VACUUM uses to remove data files no longer referenced in the current table version. The default is 7 days.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group