cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Designing Reliable Data Versioning Strategies in Databricks

Raj_DB
Contributor

Hi everyone,

Iโ€™m working on a use case where I need to retain 30 days of historical data in a Delta table and use it to build trend reports.

Iโ€™m looking for the best approach to reliably maintain this historical data while also making it suitable for reporting. I explored the Delta Lake time travel feature, but it currently only allows access to about 7 days of history, which doesnโ€™t meet my requirement.

What would be the recommended way to handle this scenario?

Thank you!

5 REPLIES 5

lingareddy_Alva
Esteemed Contributor

Hi @Raj_DB ,

For reliable 30-day trend reporting, don't rely on time travel โ€” use an explicit snapshot table instead.
The pattern:
1. Daily job appends a snapshot of your source table to a history table with a snapshot_date partition column
2. Purge rows older than 30 days with a scheduled DELETE WHERE snapshot_date < current_date() - INTERVAL 30 DAYS
3. Run VACUUM after to reclaim storage.
This gives you full control over retention and keeps trend queries fast via partition pruning.

 

LR

Thanks @lingareddy_Alva  for your suggestion.

DivyaandData
Databricks Employee
Databricks Employee

Youโ€™re hitting the new 7-day time-travel enforcement: in recent runtimes, time travel is blocked beyond delta.deletedFileRetentionDuration (default 7 days), regardless of VACUUM args or logRetentionDuration.

To support 30 days of historical data for reporting, you have two options:

1. Use Delta retention settings for 30-day time travel

If 30 days is enough history and youโ€™re okay paying for the extra storage, just configure the tableโ€™s Delta properties:

ALTER TABLE catalog.schema.table_name
SET TBLPROPERTIES (
  'delta.deletedFileRetentionDuration' = 'interval 30 days',
  'delta.logRetentionDuration'        = 'interval 30 days'
);
sql
  • delta.deletedFileRetentionDuration controls how long old data files are kept and therefore the maximum time-travel window (default 7 days).
     
  • delta.logRetentionDuration controls how long the transaction log/history is kept (default 30 days) and must be โ‰ฅ deletedFileRetentionDuration in new runtimes.
     
  • Docs explicitly note: to access 30 days of history, set delta.deletedFileRetentionDuration = "interval 30 days" to match the default logRetentionDuration.

This is the simplest way to have a 30-day rolling history directly available via time travel for your trend reports.

2. (Recommended for robust reporting) Build a dedicated history / trend table

Delta docs and guidance caution against relying on table history as a long-term backup or analytics store and recommend using history/time travel mainly for short-term ops and recovery.

For reporting use cases (especially if you may want >30 days later), a common pattern is:

  • Create a separate โ€œhistoryโ€ or โ€œsnapshotโ€ Delta table, partitioned by a logical date (e.g., snapshot_date or business date).
  • On a daily job (or whatever cadence you need):
    • Either:
      • Take a point-in-time snapshot of the source table and append it with that dayโ€™s snapshot_date, or
      • Use Change Data Feed (CDF) to upsert incremental changes into a slowly-changing dimension / fact table with effective dates.
         
  • Point your BI / trend reports at this reporting table, not at time-travel queries.

This pattern keeps reporting predictable and decoupled from operational retention and VACUUM settings, while you still keep ~30 days of native time travel for debugging and recovery.

Raj_DB
Contributor

Hi @DivyaandData , Thank you so much for the explanation. I have one question: why is time travel not a good choice, and does it cost more than using a simple append-only history table?

DivyaandData
Databricks Employee
Databricks Employee

Hey @Raj_DB , The TLDR is  time travel is great for short-term ops and debugging, but brittle as your primary reporting history, and its cost profile is harder to control and reason about than a purpose-built history table.

  • Docs 1,2 explicitly say Delta table history/time travel is for auditing, rollback, and point-in-time queries, and is not recommended as a long-term backup/archival solution.
  • In new runtimes, time travel is blocked once you go beyond delta.deletedFileRetentionDuration (default 7 days), regardless of VACUUM arguments. Source
  • Time travel depends on both log and data retention; VACUUM and log cleanup permanently remove older versions. Docs - 1,2
In terms of cost -
  • Thereโ€™s no separate โ€œtime travel feeโ€ โ€“ you pay for storage and compute in both patterns. The cost difference comes from how much data you retain and how itโ€™s organized.
  • To get a long time-travel window, you must keep all old data files and logs for the source table for that period (deletedFileRetentionDuration and logRetentionDuration), which increases storage and can slow queries because they scan more versions/files. Source

With a dedicated history table, you can:

  • Store only what reporting needs (e.g., daily snapshots or SCD facts instead of every tiny intermediate write).
  • Partition and compact for reporting patterns.
  • Apply your own retention (e.g., 13 months) without affecting operational tables or being surprised by VACUUM behavior.