Unstable workflow runs lately

FrankTa · ‎06-28-2024

Hi!

We are using Databricks on Azure on production since about 3 months. A big part of what we use Databricks for is processing data using a workflow with various Python notebooks. We run the workflow on a 'Pools' cluster and on a 'All-purpose compute'. All computes use Databricks Runtime Version 13.3 LTS.

Since about 3 weeks we have been facing regular failures of tasks in our pipeline that all seem to be due to technical and non-reproducible errors. In most cases a repair of the task fixes runs just fine, but obviously our trust in the platform has taken a beating because of this.

Some of the problems we regularly see:

Cluster 'xxx' was terminated. Reason: COMMUNICATION_LOST (CLOUD_FAILURE)
Failed to acquire a SAS token for list on /__unitystorage/catalogs/xxx/tables/xxx/_delta_log due to java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: 403: Invalid Authorization
Fatal error: The Python kernel is unresponsive.
run failed with error message Cluster xxx became unusable during the run since the driver became unhealthy
com.databricks.common.client.DatabricksServiceHttpClientException: 403: Invalid Authorization

Our Databricks instance is hosted in Azure West Europe.

Does anybody have similar experiences? And if so, did you find a way to add more stability?