Databricks Community

Sahha_Krishna · ‎07-27-2023

Unable to start the Cluster in AWS-hosted Databricks because of the below reason

{
  "reason": {
    "code": "BOOTSTRAP_TIMEOUT",
    "parameters": {
      "databricks_error_message": "[id: InstanceId(i-0634ee9c2d420edc8), status: INSTANCE_INITIALIZING, workerEnvId:WorkerEnvId(workerenv-1653812411865398-41432fd9-c426-43ee-84fd-f161341ac1db), lastStatusChangeTime: 1690489336310, groupIdOpt Some(0),requestIdOpt Some(0717-085009-xgjqxcqe-70e3f1a0-0524-43cc-b),version 1] with threshold 700 seconds timed out after 701058 milliseconds. Please check network connectivity from the data plane to the control plane.",
      "instance_id": "i-0634ee9c2d420edc8"
    }
  },
  "add_node_failure_details": {
    "failure_count": 2,
    "resource_type": "container",
    "will_retry": false
  }
}

It all started when we arranged a Peering connection between Databricks default VPC and out VPC. Rolled back all the changes, but the problem still persists. In AWS, I can see the EC2 instances are initialized and running but something wrong other than that.

Any help would be greatly appreciated.

User16539034020 · ‎10-26-2023

Hi, Sahha:

Thanks for contacting Databricks Support.

This is the common type of error, which indicates that the bootstrap failed due to a misconfigured data plane network. Databricks requested EC2 instances for a new cluster, but encountered a long delay while waiting for the EC2 instance to bootstrap, and connect to the control plane. The cluster manager terminates the instances, and reports this error.

Please go to AWS console and download the EC2 system log by following the instructions:

Open the Amazon EC2 console
In the left navigation pane, choose Instances, and select the instance using the instance ID.
The instance ID, which starts with i-xxxxxx, will be printed in the Event Log section of the cluster details page. Note that the instance must be terminated within the last hour; otherwise, it will not show up in the list. If the cluster creation failure happened a long time ago, restart the cluster to reproduce the error first.
Choose Actions > Monitor and troubleshoot > Get System Log.
Click the Download button to download the system log. It may take a few minutes for the system log to show up if the cluster is just started.

Check the system log, look for messages starting with the prefix: [timestamp, Bootstrap Event]. Search for FAILED_MESSAGE, and use a Base64 decode tool to decode the message. The message should give the reason why bootstrap failed.

Regards,