cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Retention window from DLT created Delta tables

TinasheChinyati
New Contributor

Hi guys

I am working with data ingested from Azure EventHub using Delta Live Tables in databricks. Our data architecture includes the medallion approach. Our current requirement is to retain only the most recent 14 days of data in the silver layer. To meet this retention requirement, I am considering using DELETE command to remove records older than 14 days based on a specific date field, followed by a VACUUM operation to reclaim storage space. This process would be scheduled to run daily. 

Could you confirm if this approach is optimal for maintaining a 14-day retention window in the silver layer? Additionally would this method effectively manage storage costs while preserving Delta Live Tables' integrity for incremental processing? Since the files will still exist in the external storage account, I can then Life Cycle Management Policy to delete that data?

Thank you for your guidance on best practices for this setup.

1 REPLY 1

MuthuLakshmi
Databricks Employee
Databricks Employee

@TinasheChinyati In-order to retain the 14 days history of a table you have tune the below parameters. 

To query a previous table version, you must retain both the log and the data files for that version.

Data files are deleted when VACUUM runs against a table. Delta Lake manages log file removal automatically after checkpointing table versions.

Because most Delta tables have VACUUM run against them regularly, point-in-time queries should respect the retention threshold for VACUUM, which is 7 days by default.

In order to increase the data retention threshold for Delta tables, you must configure the following table properties:

  • delta.logRetentionDuration = "interval <interval>": controls how long the history for a table is kept. The default is interval 30 days.

  • delta.deletedFileRetentionDuration = "interval <interval>": determines the threshold VACUUM uses to remove data files no longer referenced in the current table version. The default is interval 7 days.

You must set both of these properties to ensure table history is retained for longer duration for tables with frequent VACUUM operations. For example, to access 30 days of historical data, set delta.deletedFileRetentionDuration = "interval 30 days" (which matches the default setting for delta.logRetentionDuration). 

In your case to retain 14 days records, keep the delta.deletedFileRetentionDuration = "interval 14 days"

https://docs.databricks.com/en/delta/history.html

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group