cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive

ambigus9
New Contributor III

Currently I trying to Create a Compute Cluster on a Workspaces with Privatelink and Custom VPC.

I'm using Terraform: https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/aws-private-link-wo...

After the deployment is completed, I try to Create a Compute cluster but I'm getting following error:

Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive

Spark driver became unresponsive on startup. This issue can be caused by invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.

Internal error message: Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and init script misconfiguration.

I tried everything: creating the S3 Gateway Endpoint, STS Interface Endpoint, Kinesis-Streams Interface Endpoint, 

ambigus9_0-1735912629708.png

also in the Security Group I have the corresponding ports on Inbound and Outbound rules:

Security Group - Network Workspace - Inbound Rules

ambigus9_1-1735912708564.png

Security Group - Network Workspace - Outbound Rules

ambigus9_2-1735912741139.png

Any help will be appretiated. Thanks!

 

7 REPLIES 7

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @ambigus9,

It seems like you are encountering issues with creating a compute cluster in a Databricks workspace configured with PrivateLink and a custom VPC using Terraform. The error message indicates that the Spark driver is becoming unresponsive on startup, which could be due to several reasons such as invalid Spark configurations, library conflicts, incorrect metastore configuration, or misconfigured init scripts.

Here are some steps you can take to troubleshoot and resolve the issue:

  1. Check Spark Configurations and Init Scripts:
    • Review the Spark configurations and ensure they are correctly set up. Invalid configurations can cause the driver to become unresponsive.
    • Verify that the init scripts are correctly configured and do not contain errors that could prevent the Spark driver from starting.
  2. Review Security Group Rules:
    • Ensure that the security group rules for both inbound and outbound traffic are correctly configured. The necessary ports (443, 2443, 6666, 8443, 8444, 8445-8451) should be open as required by Databricks.
    • Make sure that the security group allows traffic between the workspace subnets and the VPC endpoints.
  3. Check VPC Endpoints:
    • Verify that the VPC endpoints for the workspace and secure cluster connectivity relay are correctly set up and associated with the appropriate subnets and security groups.
    • Ensure that the DNS hostnames and DNS resolution are enabled for the VPC.
  4. Review Network ACLs:
    • Ensure that the network ACLs for the subnets allow bidirectional (outbound and inbound) rules for the necessary ports.
  5. Check AWS Service Endpoints:
    • Ensure that the necessary AWS service endpoints (S3, STS, Kinesis) are correctly set up and accessible from the workspace subnets.
  6. Review Spark Driver Logs:
    • Access the Spark driver logs to get more detailed information about the error. The logs can provide insights into what might be causing the driver to become unresponsive.

Thanks for you quickly answer! I'm curious about the Security Groups.

There are two security Groups: One that I must create and indicate it to Terraform and the second is Created By Terraform and the description is: Data Plane VPC endpoint security group. 

1) Which one must have 443, 2443, 6666, 8443, 8444, 8445-8451 ports opened?

2) Which should be the Destination?

I have this configuration:

Data Plane VPC endpoint security group

ambigus9_0-1735913915707.png

Security Group Created for Databricks Network (Workspace)

ambigus9_1-1735914100177.png

There are correctly configurated?

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @ambigus9,

Have you reviewed driver logs of the cluster, that would give us a clue on what the root of the issue is.

The security group that must have ports 443, 2443, 6666, 8443, 8444, 8445-8451 opened is the one created by Terraform, which is described as the "Data Plane VPC endpoint security group."

2) The destination for these ports should be 0.0.0.0/0, which allows traffic to any destination. This is necessary for the Databricks infrastructure, cloud data sources, library repositories, secure cluster connectivity, and other internal Databricks services

https://docs.databricks.com/en/security/network/classic/privatelink.html

Hi @Alberto_Umana ,

Data Plane VPC endpoint Security Group - Inbound Rules

ambigus9_1-1735923037660.png

 

Data Plane VPC endpoint Security Group - Outbound Rules

ambigus9_0-1735922985738.png

Security Group Workspaces Network - Inbound Rules

ambigus9_2-1735923149645.png

Security Group Workspaces Network - Outbound Rules

ambigus9_3-1735923195980.png

1) Are these correctly configurated?

2) I'm curious about the fact the EC2 Workers uses the Security Group Workspaces Network as you can see in the following image:

ambigus9_4-1735923329535.png

 

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @ambigus9,

Yeah it does look to be fine, also I see workerenv running, what failure do you see in the driver logs while launching a cluster?

Also what is the status of the EC2 VM launched when cluster is spun up?

Alberto_Umana
Databricks Employee
Databricks Employee

Thanks for the details.

Can you ensure that the network connection to the metastore database is stable and that there are no firewall rules or security groups blocking access to the database. You can use the nc command to verify connectivity to the database host and port.

You can see it here:

25/01/03 18:59:40 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=mdpartyyphlhsp.caj77bnxuhme.us-west-2.rds.amazonaws.com, port=3306, dbName=organization2149045078433955, user=f7tWV573MJqOHYAs}). (timeSinceLastSuccess=0)

From a notebook you can do nc command to the RDS above and port!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group