Lately we encountered the issue that the classic compute clusters could not start. With the help of Databricks team to troubleshoot, we found the issue and get it fixed. So I think writing it here could help other people who would encounter the same problem in the future.
Setup
The DBx workspace was setup with VNet injection (followed this doc) and had firewall enabled on the DBx workspace storage account with below settings:
Workspace Storage firewall:
- allows the DBx serverless subnets
- uses Private Endpoint for classic compute
Problem
The serverless compute worked fine. But the classic compute failed to start most of the time, the weird things were:
- there were very few computes could start fine sometimes (not all the time), but the starting took up to more than 10mins
- most of the classic computes and most of the time failed to start with the following error
The main error messages are "The data plane network is misconfigured. Please verify that the network for your data plane is configured correctly." and "...because starting the FUSE daemon timed out. This may happen because your VMs do not have outbound connectivity to DBFS storage. Also consider upgrading your cluster to a later spark version."
This points out that the cluster VM can't connect to the DBFS storage which is the workspace storage.
Troubleshot issues
1. Check the outbound NSG rules of workspace private & public subnets --> confirmed no egress block, so NSG egress is not an issue
2. Troubleshot the storage private endpoint:
- Manually create a VM inside (any subnet) in the same VNet of workspace storage
- RDP to the VM
- Do "nslookup" and "telnet" to the following destination:
- workspace host
- the FQDN of the storage's blob private endpoint (usually is "[storage_name].blob.core.windows.net")
- the FQDN of the storage's dfs private endpoint (usually is "[storage_name].dfs.core.windows.net")
Result of the nslookup:
- workspace host is resolved with private DNS zone and IP address ==> this is correct, so this is not an issue
- the nslookup of the storage's blob and dfs FQDN didn't resolve through private DNS and private IP address, but it went through public IP address ==> so bang, this could be the issue!!!!
Then we tried to figure out why the dfs and blob endpoint doesn't resolve with private IP.
- Looking at the private DNS Zone of the blob and dfs found out that these Private DNS zone were not linked to the VNet
Solution
Because the private DNS zone was not linked to the VNet, so the solution is adding the VNet link in the private DNS Zone. After that testing the nslookup on the blob and dfs private endpoint FQDN again, they resolved in the private IP and private DNS.
After that all computes always start successfully 🙂 and it usually starts within 4-5mins.
Though we could not explain why a few computes could start while the private DNS zone was not linked to the VNet before the fix.