cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

what is the best way to optimize historical data tables in databricks?

Ruby8376
Valued Contributor

Hi There,

Looking for recommendations on how to optimize history data tables in databricks?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @Ruby8376 , 

 - Use Table History and Time Travel:

 • Each operation creates a new table version
 • Can be used for auditing, rollback, and querying at a specific point in time
 • Not recommended for long-term backup
 • Use the past seven days for time travel unless retention configurations are set to a larger value
 • Code: 

deltaTable.history().show()

Partition Tables:


 • Beneficial for large tables (>1 TB)
 • All partitions should have at least a gigabyte of data
 • Fewer, larger partitions perform better than many smaller partitions
 • Do not partition tables with less than a terabyte of data-

Regularly Run VACUUM:
 • Reduces excess cloud data storage costs
 • Default retention threshold is seven days
 • Code: 

deltaTable.vacuum()

Use OPTIMIZE Command:
 • Compacts small data files for enhanced query performance
 • Recommended to run daily and adjust the frequency for cost and performance trade-offs
 • Code: 

deltaTable.optimize()

Use Clustering:
 • Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts

Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Ruby8376 , 

 - Use Table History and Time Travel:

 • Each operation creates a new table version
 • Can be used for auditing, rollback, and querying at a specific point in time
 • Not recommended for long-term backup
 • Use the past seven days for time travel unless retention configurations are set to a larger value
 • Code: 

deltaTable.history().show()

Partition Tables:


 • Beneficial for large tables (>1 TB)
 • All partitions should have at least a gigabyte of data
 • Fewer, larger partitions perform better than many smaller partitions
 • Do not partition tables with less than a terabyte of data-

Regularly Run VACUUM:
 • Reduces excess cloud data storage costs
 • Default retention threshold is seven days
 • Code: 

deltaTable.vacuum()

Use OPTIMIZE Command:
 • Compacts small data files for enhanced query performance
 • Recommended to run daily and adjust the frequency for cost and performance trade-offs
 • Code: 

deltaTable.optimize()

Use Clustering:
 • Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts

Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)

Ruby8376
Valued Contributor

Thanks Kaniz!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group