12-05-2022 03:04 AM
Good morning, and thank you for the support
In our scheduled job one cluster failed to start with the following error:
```
Run result unavailable: job failed with error message
Unexpected failure while waiting for the cluster to be ready.Cause Unexpected state for cluster: INVALID_ARGUMENT(CLIENT_ERROR): databricks_error_message: Container setup failed because of an invalid request: Spark image "release__10.4.x-snapshot-scala2.12__databricks-universe__head__dab7230__ee00e81__jenkins__9b44ccb__format-2" failed to download or does not exist.
```
This job and related configuration have worked for the past month, as well as for the runs in the days after we received this error.
In the job several cluster are spun up, and on the day of the failure only one had the error.
Where could we find more information or access some log files?
Is there a way to automatically retry to instantiate the cluster? Is setting the `Task retry policy` sufficient to patch this error, or would a retry simply find the cluster in an error state?
I have looked through https://learn.microsoft.com/en-us/azure/databricks/kb/clusters/termination-reasons but could not find a related issue.
Cheers
12-05-2022 05:07 AM
@Pietro Maria Nobili
You can use Task retry policy, it will start the job cluster again. Since the scope of the job cluster ends when the task completes or failed.
12-09-2022 12:57 AM
We're experiencing the same issue on our production environment, pretty much the same error just one with 9.1 and one with 11.3 runtime versions, both LTS. The pipelines do recover on subsequent runs so looks like this is an intermittent issue, might be an issue on databricks side. Would be good to get an answer.
12-12-2022 12:00 AM
Hi!
We encounter exactly the same issue, using 10.4 LTS image on Azure EU.
Our workflows are starting between 5-9 AM CET, and every day and multiple of them fails each day.
A simple retry solves the issue, it is very frustrating, please give us an update about this.
01-16-2023 11:27 PM
Similarly here, on Azure, on 10.4
01-25-2023 12:22 AM
Same issue on my end, seems to be a wider problem?
Azure, 10.4 LTS ML runtime... @Jose Gonzalez , @Landan George or anyone, please follow up on this topic. In my case, issue appears on PROD env.
Has anyone solved this issue?
01-25-2023 02:14 AM
Same issue here, workflow has been working without any problems and this morning we got the error:
Run result unavailable: job failed with error message
Unexpected failure while waiting for the cluster (--------------) to be ready.Cause Unexpected state for cluster (----------------: INVALID_ARGUMENT(CLIENT_ERROR): databricks_error_message:Container setup failed because of an invalid request: Spark image "release__9.1.x-snapshot-scala2.12__databricks-universe__head__9e5c85c__96835dd__jenkins__140ce8f__format-2" failed to download or does not exist.
01-25-2023 08:00 AM
@Rita Fernandes @Kajetan Gęgotek @Yoshi Coppens @Viktor Fulop @Andrius Vitkauskas @Pietro Maria Nobili
It looks like an issue with Azure account limits, the Databricks eng team is looking into it. Apart from retries, I'd suggest running jobs not on the hour, so instead of running a job at 1:00 AM, run it on 1:17 AM (example) which should help.
If I hear more I'll respond to this thread.
Thanks @Kajetan Gęgotek for tagging me and bringing this to my attention
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group