- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-31-2022 08:20 PM
So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.
Should we enable "optimized writes" by setting the following at a workspace level?
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true") # for writing speed
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true") # compressing files
OR
Should we explicitly execute OPTIMIZE command on tables and databases at a set frequency. Also, if we enable Optimized writes at a workspace level, should we separately have to execute OPTIMIZE again at a table level. Are they same or different?
After the decision around OPTIMIZE is settled, when should we run VACUUM. Should we run both OPTIMIZE and vacuum in the same script? If not, what should be the ideal order