Hi @Ruby8376 ,
- Use Table History and Time Travel:
• Each operation creates a new table version
• Can be used for auditing, rollback, and querying at a specific point in time
• Not recommended for long-term backup
• Use the past seven days for time travel unless retention configurations are set to a larger value
• Code:
deltaTable.history().show()
Partition Tables:
• Beneficial for large tables (>1 TB)
• All partitions should have at least a gigabyte of data
• Fewer, larger partitions perform better than many smaller partitions
• Do not partition tables with less than a terabyte of data-
Regularly Run VACUUM:
• Reduces excess cloud data storage costs
• Default retention threshold is seven days
• Code:
deltaTable.vacuum()
Use OPTIMIZE Command:
• Compacts small data files for enhanced query performance
• Recommended to run daily and adjust the frequency for cost and performance trade-offs
• Code:
deltaTable.optimize()
Use Clustering:
• Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts
Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)