Databricks Community

chenda · ‎10-07-2024

Lately we encountered the issue that the classic compute clusters could not start. With the help of Databricks team to troubleshoot, we found the issue and get it fixed. So I think writing it here could help other people who would encounter the same problem in the future.

Setup

The DBx workspace was setup with VNet injection (followed this doc) and had firewall enabled on the DBx workspace storage account with below settings:

Workspace Storage firewall:

allows the DBx serverless subnets
uses Private Endpoint for classic compute

Problem

The serverless compute worked fine. But the classic compute failed to start most of the time, the weird things were:

there were very few computes could start fine sometimes (not all the time), but the starting took up to more than 10mins
most of the classic computes and most of the time failed to start with the following error

The main error messages are "The data plane network is misconfigured. Please verify that the network for your data plane is configured correctly." and "...because starting the FUSE daemon timed out. This may happen because your VMs do not have outbound connectivity to DBFS storage. Also consider upgrading your cluster to a later spark version."

This points out that the cluster VM can't connect to the DBFS storage which is the workspace storage.

Troubleshot issues

1. Check the outbound NSG rules of workspace private & public subnets --> confirmed no egress block, so NSG egress is not an issue

2. Troubleshot the storage private endpoint:

Manually create a VM inside (any subnet) in the same VNet of workspace storage
RDP to the VM
Do "nslookup" and "telnet" to the following destination:
- workspace host
- the FQDN of the storage's blob private endpoint (usually is "[storage_name].blob.core.windows.net")
- the FQDN of the storage's dfs private endpoint (usually is "[storage_name].dfs.core.windows.net")

Result of the nslookup:

workspace host is resolved with private DNS zone and IP address ==> this is correct, so this is not an issue
the nslookup of the storage's blob and dfs FQDN didn't resolve through private DNS and private IP address, but it went through public IP address ==> so bang, this could be the issue!!!!

Then we tried to figure out why the dfs and blob endpoint doesn't resolve with private IP.

Looking at the private DNS Zone of the blob and dfs found out that these Private DNS zone were not linked to the VNet

Solution

Because the private DNS zone was not linked to the VNet, so the solution is adding the VNet link in the private DNS Zone. After that testing the nslookup on the blob and dfs private endpoint FQDN again, they resolved in the private IP and private DNS.

After that all computes always start successfully 🙂 and it usually starts within 4-5mins.

Though we could not explain why a few computes could start while the private DNS zone was not linked to the VNet before the fix.

jem · ‎01-25-2025

Thanks for sharing. We had the same problem. I missed to add private endpoints to the workspace storage account in the managed resource group. I will also add NCC rules in the Databricks account. Then you don't need the subnets in the firewall.

Databricks Community

Databricks Cluster Failed to Start - ADD_NODES_FAILED (Solution)

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!