Friday
Currently I trying to Create a Compute Cluster on a Workspaces with Privatelink and Custom VPC.
I'm using Terraform: https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/aws-private-link-wo...
After the deployment is completed, I try to Create a Compute cluster but I'm getting following error:
Failed to add 3 workers to the compute. Will attempt retry: true. Reason: Driver unresponsive
Spark driver became unresponsive on startup. This issue can be caused by invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and init script misconfiguration.
I tried everything: creating the S3 Gateway Endpoint, STS Interface Endpoint, Kinesis-Streams Interface Endpoint,
also in the Security Group I have the corresponding ports on Inbound and Outbound rules:
Security Group - Network Workspace - Inbound Rules
Security Group - Network Workspace - Outbound Rules
Any help will be appretiated. Thanks!
Friday
Hello @ambigus9,
It seems like you are encountering issues with creating a compute cluster in a Databricks workspace configured with PrivateLink and a custom VPC using Terraform. The error message indicates that the Spark driver is becoming unresponsive on startup, which could be due to several reasons such as invalid Spark configurations, library conflicts, incorrect metastore configuration, or misconfigured init scripts.
Here are some steps you can take to troubleshoot and resolve the issue:
Friday
Thanks for you quickly answer! I'm curious about the Security Groups.
There are two security Groups: One that I must create and indicate it to Terraform and the second is Created By Terraform and the description is: Data Plane VPC endpoint security group.
1) Which one must have 443, 2443, 6666, 8443, 8444, 8445-8451 ports opened?
2) Which should be the Destination?
I have this configuration:
Data Plane VPC endpoint security group
Security Group Created for Databricks Network (Workspace)
There are correctly configurated?
Friday
Hi @ambigus9,
Have you reviewed driver logs of the cluster, that would give us a clue on what the root of the issue is.
The security group that must have ports 443, 2443, 6666, 8443, 8444, 8445-8451 opened is the one created by Terraform, which is described as the "Data Plane VPC endpoint security group."
2) The destination for these ports should be 0.0.0.0/0, which allows traffic to any destination. This is necessary for the Databricks infrastructure, cloud data sources, library repositories, secure cluster connectivity, and other internal Databricks services
https://docs.databricks.com/en/security/network/classic/privatelink.html
Friday
Hi @Alberto_Umana ,
Data Plane VPC endpoint Security Group - Inbound Rules
Data Plane VPC endpoint Security Group - Outbound Rules
Security Group Workspaces Network - Inbound Rules
Security Group Workspaces Network - Outbound Rules
1) Are these correctly configurated?
2) I'm curious about the fact the EC2 Workers uses the Security Group Workspaces Network as you can see in the following image:
Friday
Hi @ambigus9,
Yeah it does look to be fine, also I see workerenv running, what failure do you see in the driver logs while launching a cluster?
Also what is the status of the EC2 VM launched when cluster is spun up?
Friday
Hi @Alberto_Umana ,
Here are the logs:
Standard_Output: https://gist.github.com/ambigus9/c4c17ef936a2c5fb077e26b84498b50a
Standard Error: https://gist.github.com/ambigus9/b5ef9b8ef3171189e21efd659c67d2bd
Log4j Output: https://gist.github.com/ambigus9/9911fd669d7ea914534c3a1d0cfd8dab
The status of the EC2 VM's Launched looks fine:
Friday
Thanks for the details.
Can you ensure that the network connection to the metastore database is stable and that there are no firewall rules or security groups blocking access to the database. You can use the nc
command to verify connectivity to the database host and port.
You can see it here:
25/01/03 18:59:40 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=mdpartyyphlhsp.caj77bnxuhme.us-west-2.rds.amazonaws.com, port=3306, dbName=organization2149045078433955, user=f7tWV573MJqOHYAs}). (timeSinceLastSuccess=0)
From a notebook you can do nc command to the RDS above and port!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group