- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 06:01 AM
Currently I trying to Create a Compute Cluster on a Workspaces with Privatelink and Custom VPC.
I'm using Terraform: https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/aws-private-link-wo...
After the deployment is completed, I try to Create a Compute cluster but I'm getting following error:
Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive
Spark driver became unresponsive on startup. This issue can be caused by invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and init script misconfiguration.
I tried everything: creating the S3 Gateway Endpoint, STS Interface Endpoint, Kinesis-Streams Interface Endpoint,
also in the Security Group I have the corresponding ports on Inbound and Outbound rules:
Security Group - Network Workspace - Inbound Rules
Security Group - Network Workspace - Outbound Rules
Any help will be appretiated. Thanks!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 06:04 AM
Hello @ambigus9,
It seems like you are encountering issues with creating a compute cluster in a Databricks workspace configured with PrivateLink and a custom VPC using Terraform. The error message indicates that the Spark driver is becoming unresponsive on startup, which could be due to several reasons such as invalid Spark configurations, library conflicts, incorrect metastore configuration, or misconfigured init scripts.
Here are some steps you can take to troubleshoot and resolve the issue:
- Check Spark Configurations and Init Scripts:
- Review the Spark configurations and ensure they are correctly set up. Invalid configurations can cause the driver to become unresponsive.
- Verify that the init scripts are correctly configured and do not contain errors that could prevent the Spark driver from starting.
- Review Security Group Rules:
- Ensure that the security group rules for both inbound and outbound traffic are correctly configured. The necessary ports (443, 2443, 6666, 8443, 8444, 8445-8451) should be open as required by Databricks.
- Make sure that the security group allows traffic between the workspace subnets and the VPC endpoints.
- Check VPC Endpoints:
- Verify that the VPC endpoints for the workspace and secure cluster connectivity relay are correctly set up and associated with the appropriate subnets and security groups.
- Ensure that the DNS hostnames and DNS resolution are enabled for the VPC.
- Review Network ACLs:
- Ensure that the network ACLs for the subnets allow bidirectional (outbound and inbound) rules for the necessary ports.
- Check AWS Service Endpoints:
- Ensure that the necessary AWS service endpoints (S3, STS, Kinesis) are correctly set up and accessible from the workspace subnets.
- Review Spark Driver Logs:
- Access the Spark driver logs to get more detailed information about the error. The logs can provide insights into what might be causing the driver to become unresponsive.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 06:04 AM
Hello @ambigus9,
It seems like you are encountering issues with creating a compute cluster in a Databricks workspace configured with PrivateLink and a custom VPC using Terraform. The error message indicates that the Spark driver is becoming unresponsive on startup, which could be due to several reasons such as invalid Spark configurations, library conflicts, incorrect metastore configuration, or misconfigured init scripts.
Here are some steps you can take to troubleshoot and resolve the issue:
- Check Spark Configurations and Init Scripts:
- Review the Spark configurations and ensure they are correctly set up. Invalid configurations can cause the driver to become unresponsive.
- Verify that the init scripts are correctly configured and do not contain errors that could prevent the Spark driver from starting.
- Review Security Group Rules:
- Ensure that the security group rules for both inbound and outbound traffic are correctly configured. The necessary ports (443, 2443, 6666, 8443, 8444, 8445-8451) should be open as required by Databricks.
- Make sure that the security group allows traffic between the workspace subnets and the VPC endpoints.
- Check VPC Endpoints:
- Verify that the VPC endpoints for the workspace and secure cluster connectivity relay are correctly set up and associated with the appropriate subnets and security groups.
- Ensure that the DNS hostnames and DNS resolution are enabled for the VPC.
- Review Network ACLs:
- Ensure that the network ACLs for the subnets allow bidirectional (outbound and inbound) rules for the necessary ports.
- Check AWS Service Endpoints:
- Ensure that the necessary AWS service endpoints (S3, STS, Kinesis) are correctly set up and accessible from the workspace subnets.
- Review Spark Driver Logs:
- Access the Spark driver logs to get more detailed information about the error. The logs can provide insights into what might be causing the driver to become unresponsive.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 06:25 AM
Thanks for you quickly answer! I'm curious about the Security Groups.
There are two security Groups: One that I must create and indicate it to Terraform and the second is Created By Terraform and the description is: Data Plane VPC endpoint security group.
1) Which one must have 443, 2443, 6666, 8443, 8444, 8445-8451 ports opened?
2) Which should be the Destination?
I have this configuration:
Data Plane VPC endpoint security group
Security Group Created for Databricks Network (Workspace)
There are correctly configurated?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 07:08 AM
Hi @ambigus9,
Have you reviewed driver logs of the cluster, that would give us a clue on what the root of the issue is.
The security group that must have ports 443, 2443, 6666, 8443, 8444, 8445-8451 opened is the one created by Terraform, which is described as the "Data Plane VPC endpoint security group."
2) The destination for these ports should be 0.0.0.0/0, which allows traffic to any destination. This is necessary for the Databricks infrastructure, cloud data sources, library repositories, secure cluster connectivity, and other internal Databricks services
https://docs.databricks.com/en/security/network/classic/privatelink.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 08:57 AM
Hi @Alberto_Umana ,
Data Plane VPC endpoint Security Group - Inbound Rules
Data Plane VPC endpoint Security Group - Outbound Rules
Security Group Workspaces Network - Inbound Rules
Security Group Workspaces Network - Outbound Rules
1) Are these correctly configurated?
2) I'm curious about the fact the EC2 Workers uses the Security Group Workspaces Network as you can see in the following image:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 09:42 AM
Hi @ambigus9,
Yeah it does look to be fine, also I see workerenv running, what failure do you see in the driver logs while launching a cluster?
Also what is the status of the EC2 VM launched when cluster is spun up?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 11:50 AM
Hi @Alberto_Umana ,
Here are the logs:
Standard_Output: https://gist.github.com/ambigus9/c4c17ef936a2c5fb077e26b84498b50a
Standard Error: https://gist.github.com/ambigus9/b5ef9b8ef3171189e21efd659c67d2bd
Log4j Output: https://gist.github.com/ambigus9/9911fd669d7ea914534c3a1d0cfd8dab
The status of the EC2 VM's Launched looks fine:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-03-2025 01:42 PM
Thanks for the details.
Can you ensure that the network connection to the metastore database is stable and that there are no firewall rules or security groups blocking access to the database. You can use the nc
command to verify connectivity to the database host and port.
You can see it here:
25/01/03 18:59:40 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=mdpartyyphlhsp.caj77bnxuhme.us-west-2.rds.amazonaws.com, port=3306, dbName=organization2149045078433955, user=f7tWV573MJqOHYAs}). (timeSinceLastSuccess=0)
From a notebook you can do nc command to the RDS above and port!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-07-2025 06:33 AM
Hi @Alberto_Umana ,
After running the command I getting connection timeout:
It is curious that I getting green status of Cluster with the following logs:
Security Group Workspaces Network - Inbound Rules
Security Group Workspaces Network - Outbound Rules
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-07-2025 06:38 AM
Hi @ambigus9,
Looks like based on connectivity test to the RDS it's not working. Can you check if there is any Firewall blocking the request, since connection is not going through the RDS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-07-2025 08:00 AM
Hi, @Alberto_Umana
I would like to share with you the VPC resources map:
I using app-private-datalake-subnet-a1 and app-private-datalake-subnet-b1 to deploy the Workspace. Also, the subnet dedicated to the VPC endpoints is uat-datalake-vpc-0a448f9e2a1b0ef4e-pl-vpce. Is that OK?
Is important to note that this is a custom vpc, and It doesn't have a NAT Gateway associated, it uses a Transit Gateway. Here is the config of the subnes:
app-private-datalake-subnet-a1 - Route Table
app-private-datalake-subnet-a1 - Network ACL
app-private-datalake-subnet-b1 - Route Table
And it is really frustrating that once again I getting the same error:
Security Group - datalake-sg-workspace - Inbound Rules
Security Group - Data Plane VPC endpoint security group
- Inbound Rules
Any idea what can makes the error is rising again?

