cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Cluster Failed to Start - ADD_NODES_FAILED (Solution)

chenda
New Contributor II

Lately we encountered the issue that the classic compute clusters could not start. With the help of Databricks team to troubleshoot, we found the issue and get it fixed. So I think writing it here could help other people who would encounter the same problem in the future. 

Setup

The DBx workspace was setup with VNet injection (followed this doc) and had firewall enabled on the DBx workspace storage account with below settings:

    Workspace Storage firewall:

  •        allows the DBx serverless subnets
  •        uses Private Endpoint for classic compute

chenda_0-1728285163027.png

chenda_1-1728285255900.png

Problem

The serverless compute worked fine. But the classic compute failed to start most of the time, the weird things were: 

  •       there were very few computes could start fine sometimes (not all the time), but the starting took up to more than 10mins 
  •       most of the classic computes and most of the time failed to start with the following error

chenda_2-1728285850870.png

The main error messages are "The data plane network is misconfigured. Please verify that the network for your data plane is configured correctly." and "...because starting the FUSE daemon timed out. This may happen because your VMs do not have outbound connectivity to DBFS storage. Also consider upgrading your cluster to a later spark version."

This points out that the cluster VM can't connect to the DBFS storage which is the workspace storage. 

Troubleshot issues

1. Check the outbound NSG rules of workspace private & public subnets --> confirmed no egress block, so NSG egress is not an issue

2. Troubleshot the storage private endpoint: 

  • Manually create a VM inside (any subnet) in the same VNet of workspace storage 
  • RDP to the VM
  • Do "nslookup" and "telnet" to the following destination:
    • workspace host
    • the FQDN of the storage's blob private endpoint (usually is "[storage_name].blob.core.windows.net")
    • the FQDN of the storage's dfs private endpoint (usually is "[storage_name].dfs.core.windows.net")

    Result of the nslookup:

  •  workspace host is resolved with private DNS zone and IP address ==> this is correct, so this is not an issue
  • chenda_8-1728289622254.png

     

  • the nslookup of the storage's blob and dfs FQDN didn't resolve through private DNS and private IP address, but it went through public IP address ==> so bang, this could be the issue!!!!      

chenda_7-1728289363115.png

 

        Then we tried to figure out why the dfs and blob endpoint doesn't resolve with private IP. 

  •         Looking at the private DNS Zone of the blob and dfs found out that these Private DNS zone were not linked to the VNet

 chenda_5-1728287823325.png

Solution

Because the private DNS zone was not linked to the VNet, so the solution is adding the VNet link in the private DNS Zone. After that testing the nslookup on the blob and dfs private endpoint FQDN again, they resolved in the private IP and private DNS. 

After that all computes always start successfully 🙂  and it usually starts within 4-5mins. 

chenda_6-1728288018566.png

Though we could not explain why a few computes could start while the private DNS zone was not linked to the VNet before the fix. 

 

 

 

 

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group