Hi @Ruby8376 ,
- Use Table History and Time Travel:
โข Each operation creates a new table version
โข Can be used for auditing, rollback, and querying at a specific point in time
โข Not recommended for long-term backup
โข Use the past seven days for time travel unless retention configurations are set to a larger value
โข Code:
deltaTable.history().show()
Partition Tables:
โข Beneficial for large tables (>1 TB)
โข All partitions should have at least a gigabyte of data
โข Fewer, larger partitions perform better than many smaller partitions
โข Do not partition tables with less than a terabyte of data-
Regularly Run VACUUM:
โข Reduces excess cloud data storage costs
โข Default retention threshold is seven days
โข Code:
deltaTable.vacuum()
Use OPTIMIZE Command:
โข Compacts small data files for enhanced query performance
โข Recommended to run daily and adjust the frequency for cost and performance trade-offs
โข Code:
deltaTable.optimize()
Use Clustering:
โข Schedule OPTIMIZE job every one or two hours for tables with many updates or inserts
Sources:
- [Docs: history](https://docs.databricks.com/delta/history.html)
- [Docs: partitions](https://docs.databricks.com/tables/partitions.html)
- [Docs: vacuum](https://docs.databricks.com/delta/vacuum.html)
- [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
- [Docs: clustering](https://docs.databricks.com/delta/clustering.html)