cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Bootstrap cluster timeout for job pipeline - databricks bug?

drag7ter
Contributor

From time to time we have these erors in scheduled PROD runs. It happens when job starts and tries to create one time cluster. It happens 1 time from 10-20 runs and we are not able to identify the root cause, as all network connectivity is fine, some other jobs works fine at the same time. Why does it happen? Seems some bugs in Databricks during cluster creation?

Job's one time cluster config:

job_clusters:
- job_cluster_key: my_cluster
new_cluster:
cluster_name: ""
spark_version: 15.4.x-scala2.12
spark_conf:
spark.databricks.cluster.profile: singleNode
spark.master: local[*, 4]
aws_attributes:
first_on_demand: 1
availability: SPOT_WITH_FALLBACK
zone_id: eu-west-1b
spot_bid_price_percent: 100
node_type_id: m5d.4xlarge
driver_node_type_id: m5d.4xlarge
custom_tags:
ResourceClass: SingleNode
enable_elastic_disk: true
data_security_mode: SINGLE_USER
runtime_engine: STANDARD
num_workers: 0

The error we get, impacts out PROD runs and this is really annoying:

 

run failed with error message
 Cluster '0120-205753-51ldqtu1' was terminated. Reason: BOOTSTRAP_TIMEOUT (SERVICE_FAULT). Parameters: databricks_error_message:[id: InstanceId(i-0a8e2c9776c79e66d), status: INSTANCE_INITIALIZING, workerEnvId:WorkerEnvId(workerenv-3386680009775160-1371cfd7-90c5-4a02-84fd-eedf9d7fa269), lastStatusChangeTime: 1737406707396, groupIdOpt Some(0),requestIdOpt Some(0120-205753-51ldqtu1-a444ac40-a32d-4c63-b),version 1] with threshold 700 seconds timed out after 703368 milliseconds. Instance bootstrap inferred timeout reason: UnknownReason. Please check network connectivity from the data plane to the control plane., instance_id:i-0a8e2c9776c79e66d.

 

What does it mean Unknown Reason?

Do you know how we could fix this? all network connectivity is fine.

1 REPLY 1

NandiniN
Databricks Employee
Databricks Employee

The error message "BOOTSTRAP_TIMEOUT (SERVICE_FAULT)" indicates that the cluster was terminated because it took too long to initialize. This can happen due to various reasons, including network connectivity issues between the data plane and the control plane, or issues with the cloud provider's infrastructure.

Given the intermittent nature of the issue (1 in 10-20 runs), it might be challenging to pinpoint the exact cause. Monitoring the infrastructure and keeping track of when the errors occur can help identify any patterns or recurring issues.

I would suggest debugging it right when the issue is seen and checking all the logs and also check from the cloud provider.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group