cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

what is the best way to optimize historical data tables in databricks?

Ruby8376
Valued Contributor

Hi There,

Looking for recommendations on how to optimize history data tables in databricks?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Ruby8376 , 

 - Use Table History and Time Travel:

 • Each operation creates a new table version
 • Can be used for auditing, rollback, and querying at a specific point in time
 • Not recommended for long-term backup
 • Use the past seven days for time travel unless retention configurations are set to a larger value
 • Code: 

deltaTable.history().show()

Partition Tables:


 • Beneficial for large tables (>1 TB)
 • All partitions should have at least a gigabyte of data
 • Fewer, larger partitions perform better than many smaller partitions
 • Do not partition tables with less than a terabyte of data-

Regularly Run VACUUM:
 • Reduces excess cloud data storage costs
 • Default retention threshold is seven days
 • Code: 

deltaTable.vacuum()

Use OPTIMIZE Command:
 • Compacts small data files for enhanced query performance
 • Recommended to run daily and adjust the frequency for cost and performance trade-offs
 • Code: 

deltaTable.optimize()

Use Clustering:
 • Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts

Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Ruby8376 , 

 - Use Table History and Time Travel:

 • Each operation creates a new table version
 • Can be used for auditing, rollback, and querying at a specific point in time
 • Not recommended for long-term backup
 • Use the past seven days for time travel unless retention configurations are set to a larger value
 • Code: 

deltaTable.history().show()

Partition Tables:


 • Beneficial for large tables (>1 TB)
 • All partitions should have at least a gigabyte of data
 • Fewer, larger partitions perform better than many smaller partitions
 • Do not partition tables with less than a terabyte of data-

Regularly Run VACUUM:
 • Reduces excess cloud data storage costs
 • Default retention threshold is seven days
 • Code: 

deltaTable.vacuum()

Use OPTIMIZE Command:
 • Compacts small data files for enhanced query performance
 • Recommended to run daily and adjust the frequency for cost and performance trade-offs
 • Code: 

deltaTable.optimize()

Use Clustering:
 • Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts

Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)

Ruby8376
Valued Contributor

Thanks Kaniz!