Hi @Brianben ,
Source: https://docs.databricks.com/en/optimizations/archive-delta.html
1. How should I configure in Databricks?
To enable archival support in Databricks, you need to set the archival interval through a table property. For example:
ALTER TABLE <table_name> SET TBLPROPERTIES (delta.timeUntilArchived = '1825 days');
In this case, delta.timeUntilArchived = '1825 days'
indicates the data should be archived after five years (assuming 365 days per year). Please note that Databricks archival support depends on corresponding cloud-side lifecycle management policies (such as those provided by Azure, AWS, or GCP).
2. Can I specify a datetime column in the table as archive reference duration? Or it simply counts the duration from the time the record being inserted to the table?
Archiving in Databricks currently relies on the file creation time recorded in the Delta transaction log. There isn’t a built-in mechanism to use a custom datetime column as the trigger for archiving. In other words, Databricks will consider the age of data based on when it was written to the Delta log, not a separate timestamp column.
3. Is there any difference in terms of archiving between managed table and external table?
Based on current documentation, there are no differences in how archival works for managed vs. external Delta tables. The archival process is driven by the same Delta transaction log mechanism and cloud lifecycle management policies, regardless of whether your table is managed or external.
4. Is there any difference in implementation if the environment is UC enabled?
The existing documentation does not indicate a difference in enabling or configuring archival support for UC-enabled environments. Archival support relies on the Delta transaction log and compatible compute environments, which should operate similarly in both UC and non-UC settings.
5. As per my understanding, if we enable archive support but do not configure any lifecycle management on the storage side (we are using Azure storage account), there is no cost saving on the storage side as Databricks simply ignores the data and the underlying storage remains unchanged. May I know how we can configure lifecycle management on Azure storage account so it can align with the archival support setting on Databricks side?
If you only enable Databricks archival support without configuring lifecycle management on Azure, the data is merely marked as archived within Databricks but remains in the same storage tier on Azure, resulting in no cost reduction.
To align Databricks archival with cost-saving measures:
- Configure your Azure Storage lifecycle policies to move or delete older files after a specified age.
- Ensure the Azure policy timeframe matches (or slightly exceeds) the value in
delta.timeUntilArchived
.
For detailed guidance on creating and managing lifecycle policies in Azure Storage, please refer to:
https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview
Please let us know if this raises further questions, or does answer everything.