I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?
I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?
For these scenarios, you can use schema evolution capabilities like mergeSchema or opt to use the new VariantType to avoid requiring a schema at time of ingest.
For this style of ETL, there are 2 methods.
The first method, strictly for partitioned tables, is Dynamic Partition Overwrites, which require a Spark configuration to be set and detect which partitions that are to be overwritten by scanning the input...
At this time, Z-order columns must be specified in the asset definition, the property is pipelines.autoOptimize.zOrderCols. This may change in the future with Predictive Optimization.
Please try partition discovery for external tables. This feature should make it so that you can successfully run the MSCK REPAIR command, and more importantly, query external Parquet tables in a more performant way.