11-14-2024 12:50 AM - edited 11-14-2024 12:54 AM
Hi guys
I am working with data ingested from Azure EventHub using Delta Live Tables in databricks. Our data architecture includes the medallion approach. Our current requirement is to retain only the most recent 14 days of data in the silver layer. To meet this retention requirement, I am considering using DELETE command to remove records older than 14 days based on a specific date field, followed by a VACUUM operation to reclaim storage space. This process would be scheduled to run daily.
Could you confirm if this approach is optimal for maintaining a 14-day retention window in the silver layer? Additionally would this method effectively manage storage costs while preserving Delta Live Tables' integrity for incremental processing? Since the files will still exist in the external storage account, I can then Life Cycle Management Policy to delete that data?
Thank you for your guidance on best practices for this setup.
a month ago
Thank you for sharing the configurations. Here is a bit more clarity on our current workflow.
DELETE and VACUUM Workflow
Our workflow involves the following:
1. DELETE Operation:
We delete records matching a specific predicate to mark their corresponding files for deletion. For example:
DELETE FROM <table_name> WHERE LogSubmissionDate < current_date() - INTERVAL 14 DAYS;
https://docs.delta.io/latest/delta-update.html#delete-from-a-table
2. VACUUM Operation:
After the DELETE operation, we run VACUUM to permanently remove the files marked as deleted, including any associated data in the change_data folder:
VACUUM <table_name> RETAIN 0 HOURS;
Understanding
• If no DELETE or UPDATE operation has occurred, running VACUUM or setting any of the configurations (delta.deletedFileRetentionDuration or delta.logRetentionDuration) will have no effect.
• This is because VACUUM only acts on files that are no longer referenced in the transaction log, which requires a prior operation (like DELETE or UPDATE) to mark files for deletion.
Question
Given this understanding, if we solely rely on the configurations delta.deletedFileRetentionDuration or delta.logRetentionDuration without explicitly performing a DELETE operation, would this achieve the same result as out current DELETE-VACUUM workflow in terms of cleaning up old data?
Thank you for your insights!
11-14-2024 05:47 AM
@TinasheChinyati In-order to retain the 14 days history of a table you have tune the below parameters.
To query a previous table version, you must retain both the log and the data files for that version.
Data files are deleted when VACUUM
runs against a table. Delta Lake manages log file removal automatically after checkpointing table versions.
Because most Delta tables have VACUUM
run against them regularly, point-in-time queries should respect the retention threshold for VACUUM
, which is 7 days by default.
In order to increase the data retention threshold for Delta tables, you must configure the following table properties:
delta.logRetentionDuration = "interval <interval>"
: controls how long the history for a table is kept. The default is interval 30 days
.
delta.deletedFileRetentionDuration = "interval <interval>"
: determines the threshold VACUUM
uses to remove data files no longer referenced in the current table version. The default is interval 7 days
.
You must set both of these properties to ensure table history is retained for longer duration for tables with frequent VACUUM
operations. For example, to access 30 days of historical data, set delta.deletedFileRetentionDuration = "interval 30 days"
(which matches the default setting for delta.logRetentionDuration
).
In your case to retain 14 days records, keep the delta.deletedFileRetentionDuration = "interval 14 days"
https://docs.databricks.com/en/delta/history.html
a month ago
Thank you for sharing the configurations. Here is a bit more clarity on our current workflow.
DELETE and VACUUM Workflow
Our workflow involves the following:
1. DELETE Operation:
We delete records matching a specific predicate to mark their corresponding files for deletion. For example:
DELETE FROM <table_name> WHERE LogSubmissionDate < current_date() - INTERVAL 14 DAYS;
https://docs.delta.io/latest/delta-update.html#delete-from-a-table
2. VACUUM Operation:
After the DELETE operation, we run VACUUM to permanently remove the files marked as deleted, including any associated data in the change_data folder:
VACUUM <table_name> RETAIN 0 HOURS;
Understanding
• If no DELETE or UPDATE operation has occurred, running VACUUM or setting any of the configurations (delta.deletedFileRetentionDuration or delta.logRetentionDuration) will have no effect.
• This is because VACUUM only acts on files that are no longer referenced in the transaction log, which requires a prior operation (like DELETE or UPDATE) to mark files for deletion.
Question
Given this understanding, if we solely rely on the configurations delta.deletedFileRetentionDuration or delta.logRetentionDuration without explicitly performing a DELETE operation, would this achieve the same result as out current DELETE-VACUUM workflow in terms of cleaning up old data?
Thank you for your insights!
a month ago
Based on your inputs my understanding is that DELETE is for deleting old records from the delta table (history data).
If that's the case, DELETE followed by VACUUM with LogRetention and DeletedFileRetentionDuration will remove the underlying data files.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group