Cluster Upsize Issue: Storage Download Failure Slow

sdick_vg — Wed, 25 Sep 2024 19:18:54 GMT

Hi,

We're currently experiencing the following issue across our entire Databricks Workspace when either starting a cluster, running a workflow, or upscaling a running cluster. The following errors we receive on our AP clusters and job clusters are below:

Compute upsize complete, but below target size. The current worker count is 6, out of a target of 8. Reason: Storage Download Failure Slow

Cluster '0925-190009-qlelyoz' was terminated. Reason: STORAGE_DOWNLOAD_FAILURE_SLOW (CLIENT_ERROR). Parameters: databricks_error_message:Downloading worker artifacts onto the instance timed out.

This results in workflows failing and AP clusters not being able to gather additional resources. I haven't seen any similar issues across the community and was wondering how we can go about troubleshooting this issue.

Thank you,

Re: Cluster Upsize Issue: Storage Download Failure Slow

filipniziol — Wed, 25 Sep 2024 20:08:03 GMT

Hi @sdick_vg ,

The error is about connectivity issues when trying to reach Azure Storage.
Have you maybe enabled any kind of firewall in your organization recently?

Could you run for example code to test DNS resolution to your storage account:

Have you made any changes to the vnet where the databricks storage account is located?

topic Cluster Upsize Issue: Storage Download Failure Slow in Administration & Architecture

Cluster Upsize Issue: Storage Download Failure Slow

Re: Cluster Upsize Issue: Storage Download Failure Slow