cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Bootstrap Timeout during cluster start on AWS cloud

rpaschenko
New Contributor II

Hi!

We had bunch of strange failures for our jobs during 28-29 of September.

Some jobs` runs could not start for some time (30-50 mins) and then were failed with an error:

Unexpected failure while waiting for the cluster (0929-002141-2zkekhdj) to be ready: Cluster 0929-002141-2zkekhdj is in unexpected state Terminated: BOOTSTRAP_TIMEOUT(SUCCESS): databricks_error_message:[id: InstanceId(i-0fc5420c47a8ec703), status: INSTANCE_INITIALIZING, workerEnvId:WorkerEnvId(workerenv-4325872309166545-0cd76812-3736-4d1f-aec9-e5663c7cfd13), lastStatusChangeTime: 1695946919780, groupIdOpt Some(0),requestIdOpt Some(0929-002141-2zkekhdj-ccdc6648-fd47-4587-a),version 1] with threshold 700 seconds timed out after 707904 milliseconds. Please check network connectivity from the data plane to the control plane.,instance_id:i-0fc5420c47a8ec703.

Also some jobs` runs were failed with this event:

Failed to add 16 containers to the compute. Will attempt retry: true. Reason: Container launch failure

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @rpaschenkoThe failures you experienced on September 28-29 could be due to various reasons. 

For the jobs that could not start for some time and then failed with an error, there appears to be a timeout issue while initializing the instance. This could be due to network connectivity issues between the data and control planes. It's also possible that the instance was terminated unexpectedly. You might want to check the network connectivity and the instance's status. 

As for the jobs that failed with the event "Failed to add 16 containers to the compute" could be due to a container launch failure. This could happen due to various reasons, such as resource constraints, network issues, or problems with the container image.

To troubleshoot these issues, you might want to:

- Review the logs for more details about the errors.
- Check the network connectivity between the data plane and the control plane.
- Check the status of the instances and containers.
- Review the job configurations and the resources allocated to the jobs.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!