Hi!
We are using Databricks on Azure on production since about 3 months. A big part of what we use Databricks for is processing data using a workflow with various Python notebooks. We run the workflow on a 'Pools' cluster and on a 'All-purpose compute'. All computes use Databricks Runtime Version 13.3 LTS.
Since about 3 weeks we have been facing regular failures of tasks in our pipeline that all seem to be due to technical and non-reproducible errors. In most cases a repair of the task fixes runs just fine, but obviously our trust in the platform has taken a beating because of this.
Some of the problems we regularly see:
- Cluster 'xxx' was terminated. Reason: COMMUNICATION_LOST (CLOUD_FAILURE)
- Failed to acquire a SAS token for list on /__unitystorage/catalogs/xxx/tables/xxx/_delta_log due to java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: 403: Invalid Authorization
- Fatal error: The Python kernel is unresponsive.
- run failed with error message Cluster xxx became unusable during the run since the driver became unhealthy
- com.databricks.common.client.DatabricksServiceHttpClientException: 403: Invalid Authorization
Our Databricks instance is hosted in Azure West Europe.
Does anybody have similar experiences? And if so, did you find a way to add more stability?