Databricks

Ruby8376 · ‎08-16-2023

Hi There,

Looking for recommendations on how to optimize history data tables in databricks?

Kaniz · ‎08-17-2023

- Use Table History and Time Travel:

• Each operation creates a new table version
• Can be used for auditing, rollback, and querying at a specific point in time
• Not recommended for long-term backup
• Use the past seven days for time travel unless retention configurations are set to a larger value
• Code:

deltaTable.history().show()

Partition Tables:

• Beneficial for large tables (>1 TB)
• All partitions should have at least a gigabyte of data
• Fewer, larger partitions perform better than many smaller partitions
• Do not partition tables with less than a terabyte of data-

Regularly Run VACUUM:
• Reduces excess cloud data storage costs
• Default retention threshold is seven days
• Code:

deltaTable.vacuum()

Use OPTIMIZE Command:
• Compacts small data files for enhanced query performance
• Recommended to run daily and adjust the frequency for cost and performance trade-offs
• Code:

deltaTable.optimize()

Use Clustering:
• Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts

Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)

View solution in original post

Kaniz · ‎08-17-2023

Hi @Ruby8376 ,