I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?
I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?
No, you will still be able to time travel to versions previous to the OPTIMIZE command. OPTIMIZE is just another transaction like MERGE, UPDATE, etc. Check out these docs to learn more about retention periods and the VACUUM command.
The OPTIMIZE command is a SQL command that can be run regularly or Ad Hoc. What it does is pack small files into larger files. Additionally, you can specify predicates to only run the command on a subset of a table, and also specify that you want to ...