I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?
I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?
At this time, Z-order columns must be specified in the asset definition, the property is pipelines.autoOptimize.zOrderCols. This may change in the future with Predictive Optimization.
Please try partition discovery for external tables. This feature should make it so that you can successfully run the MSCK REPAIR command, and more importantly, query external Parquet tables in a more performant way.
Please make sure you are using Dedicated (single-user) clusters when authenticating to the file notifications service when attempting to authenticate to SQS via instance profile authentication. This likely will change in the future, so stay posted.
Auto loader's scope is limited to incrementally loading files from storage, and there is no such functionality to just load the latest file from a group of files, you'd likely want to have this kind of "last updated" logic in a different layer or in ...