I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?
I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?
Databricks has a special DBIO protocol that uses the _started and _committed files to transactionally write to cloud storage.
You can disable this by setting the below spark config
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.s...
For use cases where you want to use cloud service credentials to authenticate to cloud services, I recommend using Unity Catalog Service Credentials. These work with serverless and class compute in Databricks.
You'd create a service credential, and t...
We do not recommend using spot instances with distributed ML training workloads that use barrier mode, like TorchDistributor as these workloads are extremely sensitive to executor loss. Please disable spot/pre-emption and try again.
We have customers that read millions of files per hour+ using Databricks Auto Loader. For high-volume use cases, we recommend enabling file notification mode, which, instead of continuously performing list operations on the filesystem, uses cloud nat...