cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

what is the best way to optimize historical data tables in databricks?

Ruby8376
Valued Contributor

Hi There,

Looking for recommendations on how to optimize history data tables in databricks?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Ruby8376 , 

 - Use Table History and Time Travel:

 โ€ข Each operation creates a new table version
 โ€ข Can be used for auditing, rollback, and querying at a specific point in time
 โ€ข Not recommended for long-term backup
 โ€ข Use the past seven days for time travel unless retention configurations are set to a larger value
 โ€ข Code: 

deltaTable.history().show()

Partition Tables:


 โ€ข Beneficial for large tables (>1 TB)
 โ€ข All partitions should have at least a gigabyte of data
 โ€ข Fewer, larger partitions perform better than many smaller partitions
 โ€ข Do not partition tables with less than a terabyte of data-

Regularly Run VACUUM:
 โ€ข Reduces excess cloud data storage costs
 โ€ข Default retention threshold is seven days
 โ€ข Code: 

deltaTable.vacuum()

Use OPTIMIZE Command:
 โ€ข Compacts small data files for enhanced query performance
 โ€ข Recommended to run daily and adjust the frequency for cost and performance trade-offs
 โ€ข Code: 

deltaTable.optimize()

Use Clustering:
 โ€ข Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts

Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Ruby8376 , 

 - Use Table History and Time Travel:

 โ€ข Each operation creates a new table version
 โ€ข Can be used for auditing, rollback, and querying at a specific point in time
 โ€ข Not recommended for long-term backup
 โ€ข Use the past seven days for time travel unless retention configurations are set to a larger value
 โ€ข Code: 

deltaTable.history().show()

Partition Tables:


 โ€ข Beneficial for large tables (>1 TB)
 โ€ข All partitions should have at least a gigabyte of data
 โ€ข Fewer, larger partitions perform better than many smaller partitions
 โ€ข Do not partition tables with less than a terabyte of data-

Regularly Run VACUUM:
 โ€ข Reduces excess cloud data storage costs
 โ€ข Default retention threshold is seven days
 โ€ข Code: 

deltaTable.vacuum()

Use OPTIMIZE Command:
 โ€ข Compacts small data files for enhanced query performance
 โ€ข Recommended to run daily and adjust the frequency for cost and performance trade-offs
 โ€ข Code: 

deltaTable.optimize()

Use Clustering:
 โ€ข Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts

Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)

Ruby8376
Valued Contributor

Thanks Kaniz!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.