Databricks Community

drag7ter · ‎01-21-2025

From time to time we have these erors in scheduled PROD runs. It happens when job starts and tries to create one time cluster. It happens 1 time from 10-20 runs and we are not able to identify the root cause, as all network connectivity is fine, some other jobs works fine at the same time. Why does it happen? Seems some bugs in Databricks during cluster creation?

Job's one time cluster config:

job_clusters:
  - job_cluster_key: my_cluster
    new_cluster:
      cluster_name: ""
      spark_version: 15.4.x-scala2.12
      spark_conf:
        spark.databricks.cluster.profile: singleNode
        spark.master: local[*, 4]
      aws_attributes:
        first_on_demand: 1
        availability: SPOT_WITH_FALLBACK
        zone_id: eu-west-1b
        spot_bid_price_percent: 100
      node_type_id: m5d.4xlarge
      driver_node_type_id: m5d.4xlarge
      custom_tags:
        ResourceClass: SingleNode
      enable_elastic_disk: true
      data_security_mode: SINGLE_USER
      runtime_engine: STANDARD
      num_workers: 0

The error we get, impacts out PROD runs and this is really annoying:

run failed with error message
 Cluster '0120-205753-51ldqtu1' was terminated. Reason: BOOTSTRAP_TIMEOUT (SERVICE_FAULT). Parameters: databricks_error_message:[id: InstanceId(i-0a8e2c9776c79e66d), status: INSTANCE_INITIALIZING, workerEnvId:WorkerEnvId(workerenv-3386680009775160-1371cfd7-90c5-4a02-84fd-eedf9d7fa269), lastStatusChangeTime: 1737406707396, groupIdOpt Some(0),requestIdOpt Some(0120-205753-51ldqtu1-a444ac40-a32d-4c63-b),version 1] with threshold 700 seconds timed out after 703368 milliseconds. Instance bootstrap inferred timeout reason: UnknownReason. Please check network connectivity from the data plane to the control plane., instance_id:i-0a8e2c9776c79e66d.

What does it mean Unknown Reason?

Do you know how we could fix this? all network connectivity is fine.

NandiniN · ‎01-31-2025

The error message "BOOTSTRAP_TIMEOUT (SERVICE_FAULT)" indicates that the cluster was terminated because it took too long to initialize. This can happen due to various reasons, including network connectivity issues between the data plane and the control plane, or issues with the cloud provider's infrastructure.

Given the intermittent nature of the issue (1 in 10-20 runs), it might be challenging to pinpoint the exact cause. Monitoring the infrastructure and keeping track of when the errors occur can help identify any patterns or recurring issues.

I would suggest debugging it right when the issue is seen and checking all the logs and also check from the cloud provider.