cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How can I efficiently archive old data in Delta tables without slowing queries?

Suheb
New Contributor III

How can I remove or move older rows from my main Delta table so that queries on recent data are faster, while still keeping access to the historical data if needed?

1 REPLY 1

Coffee77
Contributor III

Hi Suheb, when using delta tables with databricks, whenever you use proper liquid clustering indexes or partitions, you should get a good performance in comparison to relational engines to deal with big data volumes.

However, you can also separate tables to store historical values in a delta table and current values in another one. There would be a lot of strategies depending on your use case. For instance:

1) Create a historical table with same schema as current table. Create a job to move data from current table to historical table on a scheduled basis. Once rows are moved, delete from current table.

2) Create a historical table based on CDF (Change Data Feed) feature to perform row level tracking. So, create a job to do the same BUT taking into account that you'll have in historical table, not only data as they were in current table but data as it was before and after updates, etc.. Very useful for row level tracking or audit purposes.

3) Irrespective of strategy, create views with UNION operator to consolidate current and historical data.

4) Optionally, run periodic jobs to optimize and vacuum large or very large delta tables. You can even integrate this job as a shared task in the above jobs to move data.

I hope this helps.


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData