01-03-2025 06:01 AM
Currently I trying to Create a Compute Cluster on a Workspaces with Privatelink and Custom VPC.
I'm using Terraform: https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/aws-private-link-wo...
After the deployment is completed, I try to Create a Compute cluster but I'm getting following error:
Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive
Spark driver became unresponsive on startup. This issue can be caused by invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and init script misconfiguration.
I tried everything: creating the S3 Gateway Endpoint, STS Interface Endpoint, Kinesis-Streams Interface Endpoint,
also in the Security Group I have the corresponding ports on Inbound and Outbound rules:
Security Group - Network Workspace - Inbound Rules
Security Group - Network Workspace - Outbound Rules
Any help will be appretiated. Thanks!
01-03-2025 06:04 AM
Hello @ambigus9,
It seems like you are encountering issues with creating a compute cluster in a Databricks workspace configured with PrivateLink and a custom VPC using Terraform. The error message indicates that the Spark driver is becoming unresponsive on startup, which could be due to several reasons such as invalid Spark configurations, library conflicts, incorrect metastore configuration, or misconfigured init scripts.
Here are some steps you can take to troubleshoot and resolve the issue:
01-03-2025 06:04 AM
Hello @ambigus9,
It seems like you are encountering issues with creating a compute cluster in a Databricks workspace configured with PrivateLink and a custom VPC using Terraform. The error message indicates that the Spark driver is becoming unresponsive on startup, which could be due to several reasons such as invalid Spark configurations, library conflicts, incorrect metastore configuration, or misconfigured init scripts.
Here are some steps you can take to troubleshoot and resolve the issue:
01-03-2025 06:25 AM
Thanks for you quickly answer! I'm curious about the Security Groups.
There are two security Groups: One that I must create and indicate it to Terraform and the second is Created By Terraform and the description is: Data Plane VPC endpoint security group.
1) Which one must have 443, 2443, 6666, 8443, 8444, 8445-8451 ports opened?
2) Which should be the Destination?
I have this configuration:
Data Plane VPC endpoint security group
Security Group Created for Databricks Network (Workspace)
There are correctly configurated?
01-03-2025 07:08 AM
Hi @ambigus9,
Have you reviewed driver logs of the cluster, that would give us a clue on what the root of the issue is.
The security group that must have ports 443, 2443, 6666, 8443, 8444, 8445-8451 opened is the one created by Terraform, which is described as the "Data Plane VPC endpoint security group."
2) The destination for these ports should be 0.0.0.0/0, which allows traffic to any destination. This is necessary for the Databricks infrastructure, cloud data sources, library repositories, secure cluster connectivity, and other internal Databricks services
https://docs.databricks.com/en/security/network/classic/privatelink.html
01-03-2025 08:57 AM
Hi @Alberto_Umana ,
Data Plane VPC endpoint Security Group - Inbound Rules
Data Plane VPC endpoint Security Group - Outbound Rules
Security Group Workspaces Network - Inbound Rules
Security Group Workspaces Network - Outbound Rules
1) Are these correctly configurated?
2) I'm curious about the fact the EC2 Workers uses the Security Group Workspaces Network as you can see in the following image:
01-03-2025 09:42 AM
Hi @ambigus9,
Yeah it does look to be fine, also I see workerenv running, what failure do you see in the driver logs while launching a cluster?
Also what is the status of the EC2 VM launched when cluster is spun up?
01-03-2025 11:50 AM
Hi @Alberto_Umana ,
Here are the logs:
Standard_Output: https://gist.github.com/ambigus9/c4c17ef936a2c5fb077e26b84498b50a
Standard Error: https://gist.github.com/ambigus9/b5ef9b8ef3171189e21efd659c67d2bd
Log4j Output: https://gist.github.com/ambigus9/9911fd669d7ea914534c3a1d0cfd8dab
The status of the EC2 VM's Launched looks fine:
01-03-2025 01:42 PM
Thanks for the details.
Can you ensure that the network connection to the metastore database is stable and that there are no firewall rules or security groups blocking access to the database. You can use the nc
command to verify connectivity to the database host and port.
You can see it here:
25/01/03 18:59:40 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=mdpartyyphlhsp.caj77bnxuhme.us-west-2.rds.amazonaws.com, port=3306, dbName=organization2149045078433955, user=f7tWV573MJqOHYAs}). (timeSinceLastSuccess=0)
From a notebook you can do nc command to the RDS above and port!
01-07-2025 06:33 AM
Hi @Alberto_Umana ,
After running the command I getting connection timeout:
It is curious that I getting green status of Cluster with the following logs:
Security Group Workspaces Network - Inbound Rules
Security Group Workspaces Network - Outbound Rules
01-07-2025 06:38 AM
Hi @ambigus9,
Looks like based on connectivity test to the RDS it's not working. Can you check if there is any Firewall blocking the request, since connection is not going through the RDS.
01-07-2025 08:00 AM
Hi, @Alberto_Umana
I would like to share with you the VPC resources map:
I using app-private-datalake-subnet-a1 and app-private-datalake-subnet-b1 to deploy the Workspace. Also, the subnet dedicated to the VPC endpoints is uat-datalake-vpc-0a448f9e2a1b0ef4e-pl-vpce. Is that OK?
Is important to note that this is a custom vpc, and It doesn't have a NAT Gateway associated, it uses a Transit Gateway. Here is the config of the subnes:
app-private-datalake-subnet-a1 - Route Table
app-private-datalake-subnet-a1 - Network ACL
app-private-datalake-subnet-b1 - Route Table
And it is really frustrating that once again I getting the same error:
Security Group - datalake-sg-workspace - Inbound Rules
Security Group - Data Plane VPC endpoint security group
- Inbound Rules
Any idea what can makes the error is rising again?
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now