Databricks Community

Brianben · ‎01-02-2025

Hi all,

I am new to Databricks and delta table. Recently I am researching on data archiving as we have data retention policy within the company.

I have studied the documentation (Archival support in Databricks | Databricks on AWS) and I am exploring the feasibility of applying such feature to my case.

Let say I have a requirement to archive the data after 5 years and all of the table are delta tables, I would like to ask the following questions:

1. How should I configure in Databricks?

2. Can I specify a datetime column in the table as archive reference duration? Or it simply count the duration from the time the record being inserted to the table?

3. Is there any difference in terms of archiving between managed table and external table?

3. Is there any difference in implementation if the environment is UC enabled?

4. As per my understanding, if we enable archive support but do not configure any lifecycle management on storage side (we are using Azure storage account), there is no cost saving on storage side as Databricks simply ignore the data and the underlining storage remains unchanged. May I know how we can configure lifecycle management on Azure storage account so it can algin with the archival support setting on Databricks side?

Highly appreciate if someone can help me with the above questions. Thanks in advance.

VZLA · ‎01-03-2025

Hi @Brianben ,

Source: https://docs.databricks.com/en/optimizations/archive-delta.html

1. How should I configure in Databricks?

To enable archival support in Databricks, you need to set the archival interval through a table property. For example:

ALTER TABLE <table_name> SET TBLPROPERTIES (delta.timeUntilArchived = '1825 days');

In this case, delta.timeUntilArchived = '1825 days' indicates the data should be archived after five years (assuming 365 days per year). Please note that Databricks archival support depends on corresponding cloud-side lifecycle management policies (such as those provided by Azure, AWS, or GCP).

2. Can I specify a datetime column in the table as archive reference duration? Or it simply counts the duration from the time the record being inserted to the table?

Archiving in Databricks currently relies on the file creation time recorded in the Delta transaction log. There isn’t a built-in mechanism to use a custom datetime column as the trigger for archiving. In other words, Databricks will consider the age of data based on when it was written to the Delta log, not a separate timestamp column.

3. Is there any difference in terms of archiving between managed table and external table?

Based on current documentation, there are no differences in how archival works for managed vs. external Delta tables. The archival process is driven by the same Delta transaction log mechanism and cloud lifecycle management policies, regardless of whether your table is managed or external.

4. Is there any difference in implementation if the environment is UC enabled?

The existing documentation does not indicate a difference in enabling or configuring archival support for UC-enabled environments. Archival support relies on the Delta transaction log and compatible compute environments, which should operate similarly in both UC and non-UC settings.

5. As per my understanding, if we enable archive support but do not configure any lifecycle management on the storage side (we are using Azure storage account), there is no cost saving on the storage side as Databricks simply ignores the data and the underlying storage remains unchanged. May I know how we can configure lifecycle management on Azure storage account so it can align with the archival support setting on Databricks side?

If you only enable Databricks archival support without configuring lifecycle management on Azure, the data is merely marked as archived within Databricks but remains in the same storage tier on Azure, resulting in no cost reduction.

To align Databricks archival with cost-saving measures:

Configure your Azure Storage lifecycle policies to move or delete older files after a specified age.
Ensure the Azure policy timeframe matches (or slightly exceeds) the value in delta.timeUntilArchived.

For detailed guidance on creating and managing lifecycle policies in Azure Storage, please refer to:

https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview

Please let us know if this raises further questions, or does answer everything.

View solution in original post

VZLA · ‎01-03-2025

Hi @Brianben ,

Source: https://docs.databricks.com/en/optimizations/archive-delta.html

1. How should I configure in Databricks?

To enable archival support in Databricks, you need to set the archival interval through a table property. For example:

ALTER TABLE <table_name> SET TBLPROPERTIES (delta.timeUntilArchived = '1825 days');

In this case, delta.timeUntilArchived = '1825 days' indicates the data should be archived after five years (assuming 365 days per year). Please note that Databricks archival support depends on corresponding cloud-side lifecycle management policies (such as those provided by Azure, AWS, or GCP).

2. Can I specify a datetime column in the table as archive reference duration? Or it simply counts the duration from the time the record being inserted to the table?

Archiving in Databricks currently relies on the file creation time recorded in the Delta transaction log. There isn’t a built-in mechanism to use a custom datetime column as the trigger for archiving. In other words, Databricks will consider the age of data based on when it was written to the Delta log, not a separate timestamp column.

3. Is there any difference in terms of archiving between managed table and external table?

Based on current documentation, there are no differences in how archival works for managed vs. external Delta tables. The archival process is driven by the same Delta transaction log mechanism and cloud lifecycle management policies, regardless of whether your table is managed or external.

4. Is there any difference in implementation if the environment is UC enabled?

The existing documentation does not indicate a difference in enabling or configuring archival support for UC-enabled environments. Archival support relies on the Delta transaction log and compatible compute environments, which should operate similarly in both UC and non-UC settings.

5. As per my understanding, if we enable archive support but do not configure any lifecycle management on the storage side (we are using Azure storage account), there is no cost saving on the storage side as Databricks simply ignores the data and the underlying storage remains unchanged. May I know how we can configure lifecycle management on Azure storage account so it can align with the archival support setting on Databricks side?

If you only enable Databricks archival support without configuring lifecycle management on Azure, the data is merely marked as archived within Databricks but remains in the same storage tier on Azure, resulting in no cost reduction.

To align Databricks archival with cost-saving measures:

Configure your Azure Storage lifecycle policies to move or delete older files after a specified age.
Ensure the Azure policy timeframe matches (or slightly exceeds) the value in delta.timeUntilArchived.

For detailed guidance on creating and managing lifecycle policies in Azure Storage, please refer to:

https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview

Please let us know if this raises further questions, or does answer everything.

Brianben · ‎01-05-2025

@VZLA Thank you very much for the detailed explanation. No further questions, thanks again.

Brianben · ‎01-05-2025

@VZLA Just one further question, do you know is it possible for a parquet file under a delta table consist of two different data written date in behind? As I am worried that if I configure Azure lifecycle policy to move the data file after 5 years but it actually consist of some data that not yet past 5 years criteria, some data would be move away earlier than expect. Thanks.

VZLA · ‎01-09-2025

@Brianben

Yes, it is possible for a Parquet file under a Delta table to consist of data written on different dates. This can happen because Delta Lake operations, such as inserts, updates, or deletes, create new Parquet files for the new records; along the same lines OPTIMIZE can compact multiple files into a single new file, which may contain data from different write operations. Therefore if a Parquet file contains data that is less than 5 years old, moving the file based on the file's creation date could indeed result in moving data that does not yet meet the 5-year criteria, so you would be moving data earlier than expected.

You can consider the following options, I've tried to sort them by complexity of implementation to keep things simpler:

Use time-based partitioning (e.g., year or month) to align data movement with lifecycle policies.
Use OPTIMIZE to manage file grouping and reduce data mixing.
Run VACUUM to remove old, unreferenced files.
Query Delta logs to identify and move only eligible data files. This requires scripting and a lot of testing, not straightforward.
Exclude Delta folders from lifecycle policies or set conditions for movement. This adds a lot of manual management overhead, hence left it as last.

Brianben · ‎01-09-2025

@VZLA I just come up with a different thought. If every time we do a operation on existing data like update, delete or optimize, new parquet files will be created and the file creation date would be the latest time. Even we may have different write operation, we should always have a later creation time, right? That doesn't mean we postpone the data archiving rather than do it sooner than expect?

Sorry for so many questions. Thanks in advance.

VZLA · ‎01-09-2025

@Brianben Correct, apologies if my previous response is introducing further confusion. I had to go back to Azure's documentation to get more in context, initially I was considering timestamps inside the files, instead of the files metadata timestamp.

Anyways, you're absolutely correct. To clarify:

The delayed archival will happen when operations like updates, deletes, or compactions modify a file, resetting its "last modified" or "creation date." Example: A file originally written on Jan 1, 2018 and updated on Jan 1, 2023 will now have a "last modified date" of Jan 1, 2023, delaying archival until Jan 1, 2028 (5 years later). Data from 2018 will effectively be archived 10 years after its original creation. And then with respect to the premature archival clarification, because the Lifecycle policies only act on blob metadata (e.g., creation or modification dates), not the timestamps of individual rows inside the Parquet files, the risk of premature archival for "newer" data inside a file does not exist when using standard lifecycle policies.

Brianben · ‎01-10-2025

@VZLA Got it, thank you for the detail explanation.

Databricks Community

Data archive with delta tables in UC enable environment

1. How should I configure in Databricks?

2. Can I specify a datetime column in the table as archive reference duration? Or it simply counts the duration from the time the record being inserted to the table?

3. Is there any difference in terms of archiving between managed table and external table?

4. Is there any difference in implementation if the environment is UC enabled?

1. How should I configure in Databricks?

2. Can I specify a datetime column in the table as archive reference duration? Or it simply counts the duration from the time the record being inserted to the table?

3. Is there any difference in terms of archiving between managed table and external table?

4. Is there any difference in implementation if the environment is UC enabled?

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!