From time to time we have these erors in scheduled PROD runs. It happens when job starts and tries to create one time cluster. It happens 1 time from 10-20 runs and we are not able to identify the root cause, as all network connectivity is fine, some other jobs works fine at the same time. Why does it happen? Seems some bugs in Databricks during cluster creation?
Job's one time cluster config:
job_clusters:
- job_cluster_key: my_cluster
new_cluster:
cluster_name: ""
spark_version: 15.4.x-scala2.12
spark_conf:
spark.databricks.cluster.profile: singleNode
spark.master: local[*, 4]
aws_attributes:
first_on_demand: 1
availability: SPOT_WITH_FALLBACK
zone_id: eu-west-1b
spot_bid_price_percent: 100
node_type_id: m5d.4xlarge
driver_node_type_id: m5d.4xlarge
custom_tags:
ResourceClass: SingleNode
enable_elastic_disk: true
data_security_mode: SINGLE_USER
runtime_engine: STANDARD
num_workers: 0
The error we get, impacts out PROD runs and this is really annoying:
run failed with error message
Cluster '0120-205753-51ldqtu1' was terminated. Reason: BOOTSTRAP_TIMEOUT (SERVICE_FAULT). Parameters: databricks_error_message:[id: InstanceId(i-0a8e2c9776c79e66d), status: INSTANCE_INITIALIZING, workerEnvId:WorkerEnvId(workerenv-3386680009775160-1371cfd7-90c5-4a02-84fd-eedf9d7fa269), lastStatusChangeTime: 1737406707396, groupIdOpt Some(0),requestIdOpt Some(0120-205753-51ldqtu1-a444ac40-a32d-4c63-b),version 1] with threshold 700 seconds timed out after 703368 milliseconds. Instance bootstrap inferred timeout reason: UnknownReason. Please check network connectivity from the data plane to the control plane., instance_id:i-0a8e2c9776c79e66d.
What does it mean Unknown Reason?
Do you know how we could fix this? all network connectivity is fine.