2 weeks ago
Hello everyone,
Recently, I received a client request to migrate our Azure Databricks environment from a Hub-and-Spoke architecture to a vWAN Hub architecture with an NVA (Network Virtual Appliance).
Here’s a quick overview of the setup:
The Databricks workspace is VNet-injected.
Private Endpoints are configured for all required services.
Two subnets are in use: Public Host and Private Host.
The routing intent on the vWAN hub is configured to send all traffic through the NVA.
Storage accounts and DNS resolution (Private Link) work correctly — verified through a VM on the same VNet.
The issue affects only the Databricks Control Plane, which cannot communicate with the cluster/compute plane.
Error Message:
Failed to add 1 worker to the compute. Will attempt retry: true.
Reason: Control Plane Request Failure Due To Misconfig
CONTROL_PLANE_REQUEST_FAILURE:
Network health check reported that instance is unable to reach Databricks Control Plane.
Please check that instances have connectivity to the Databricks Control Plane.
Instance bootstrap inferred timeout reason: NetworkHealthCheck_CP_Failed
Failure message (Base64 encoded):
dW5yZWFjaGFibGUgY3VybDogKDI4KSBSZXNvbHZpbmcgdGltZWQgb3V0IGFmdGVyIDEwMDAwIG1pbGxpc2Vjb25kcw==
VM extension code: ProvisioningState/succeeded
InstanceId: 3fc5930e53d94adb80120a420bae2724
WorkerEnv: workerenv-85992446950252
NetworkHealthCheck finished with exit code 125.
Troubleshooting done so far:
Verified NSG rules on both host subnets (allowing outbound 443).
Confirmed Private Endpoints are resolving correctly.
Checked that routing intent is sending outbound traffic via NVA as expected.
Validated that the same setup works in our previous Hub-and-Spoke model.
It seems that when using secured vWAN hubs with routing intent, the control plane traffic might not be reaching Databricks public endpoints.
Has anyone experienced similar issues or found a way to route control plane traffic properly through vWAN (or bypass it when needed)?
Any guidance or best practices for Databricks + vWAN + NVA setups would be appreciated.
Thanks,
Tuesday
Hello.
The issue was related to connectivity between the public/private hosts and the DNS resolver. In the old environment, our firewall policy did not allow communication with the DNS resolver, which caused the traffic to be blocked. In the previous setup, DNS traffic bypassed the firewall and therefore did not require a specific firewall policy.
During this issue, I studied Azure Databricks architecture in depth. If anyone needs assistance or guidance on similar problems, feel free to reach out to me.
a week ago
The problem is fixed.
a week ago
Hi @nodeb ,
Could you share the solution then with community?
Tuesday
Hello.
The issue was related to connectivity between the public/private hosts and the DNS resolver. In the old environment, our firewall policy did not allow communication with the DNS resolver, which caused the traffic to be blocked. In the previous setup, DNS traffic bypassed the firewall and therefore did not require a specific firewall policy.
During this issue, I studied Azure Databricks architecture in depth. If anyone needs assistance or guidance on similar problems, feel free to reach out to me.
Tuesday
@nodeb Can you please mark your reply as solution. It will help other users find the resolution fast.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now