cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Intermittent "Unexpected failure while waiting for the cluster to be ready" Error

108387
New Contributor II

We are encountering an occasional issue where jobs may fail with a message like the following:

Run result unavailable: job failed with error message Unexpected failure while waiting for the cluster (ID) to be ready.Cause Unexpected state for cluster (ID): BOOTSTRAP_TIMEOUT(SUCCESS): databricks_error_message:[id: InstanceId(ID), status: INSTANCE_INITIALIZING, workerEnvId:WorkerEnvId(ID), lastStatusChangeTime: 1651979481336, groupIdOpt None,requestIdOpt Some(ID),version 0] with threshold 700 seconds timed out after 700186 milliseconds. Please check network connectivity from the data plane to the control plane.,instance_id:ID

We have seen the related post https://community.databricks.com/s/question/0D53f00001fR8LGCA0/problem-with-spinning-up-a-cluster-on..., but unlike that issue, ours fails less than 5% of the time. Any job may fail due to this issue, and there is no common time for this to happen.

We were able to pull AWS EC2 logs of failed and successful runs, but there are no obvious errors or differences between the two. The failed runs still bootstrap correctly and connect to Databricks, for instance.

If it helps, all jobs are set up through dbx using the same cluster settings (spot with fallback).

What exactly does this error mean, and how might we go about addressing it?

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Benjamin Niedzielski​ , This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs.

108387
New Contributor II

Thank you for your response! We are on AWS rather than Azure, and (as a result?) the error messages do not seem to match those in the article you provide. We have tried some of the suggestions anyways, such as removing external maven libraries, to no avail.

Switching to pools has mostly reduced the issue, but when new clusters are needed because the pool clusters are already in use, we occasionally receive the original error still.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.