cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks cluster cannot reach SQL Server over VPC peering despite EC2 connectivity - AWS

learti
New Contributor III

We are experiencing a networking issue where a Databricks cluster cannot connect to a SQL Server instance hosted in another VPC, even though connectivity from a regular EC2 instance works.

Two EC2 instances deployed in the Databricks subnets (NatSubnet and ClusterSubnet) can successfully reach the database, confirming that VPC peering, routing tables, security groups, and NACLs are configured correctly. However, when a Databricks job runs on a cluster, connections to the same database time out.

I am using default VPC created by Databricks within my AWS account and not customer-managed. Is this a limitation?

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions

Hey Steve, thanks a lot for the detailed reply.

I was able to find the issue. The networking on AWS was configured correctly, however when I tried connection to the Database from Databricks I was using the Serverless, and I guess Databricks run them on their own infrastructure. However, then I noticed the "Compute" and from Compute I was able to reach my database.

View solution in original post

2 REPLIES 2

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @learti,

This is a very common scenario -- EC2 instances in the Databricks subnets can reach your SQL Server, but the Databricks cluster itself cannot. The good news is that VPC peering is fully supported with Databricks-managed VPCs, so this is NOT a fundamental limitation. The issue is almost certainly a configuration detail in one of the areas below.

Here are the most likely causes, in order of probability:


CAUSE #1: SECURITY GROUP MISCONFIGURATION (MOST LIKELY)

When Databricks creates a managed VPC, it provisions two security groups:

- A "Managed" security group (used internally by Databricks -- do NOT modify this)
- An "Unmanaged" security group (this is the one your cluster nodes actually use for outbound traffic)

The key issue: Your SQL Server's security group needs an inbound rule that allows traffic from the Databricks "Unmanaged" security group -- not the Managed one, and not the subnet CIDR.

How to fix:
1. In the AWS VPC Dashboard, find the security groups associated with your Databricks VPC
2. Identify the security group with "Unmanaged" in the name and copy its ID
3. Go to the security group attached to your SQL Server instance
4. Add an inbound rule: Custom TCP / Port 1433 / Source = the Unmanaged security group ID

This is different from what you tested with EC2 instances. Your EC2 instances may have been using a different security group, or the SQL Server security group may allow traffic from the subnet CIDR but not from the Databricks Unmanaged security group specifically.

Docs (Steps 7-8): https://docs.databricks.com/en/security/network/classic/vpc-peering.html


CAUSE #2: ROUTE TABLE ASSOCIATION MISMATCH

This is the second most common cause. The VPC peering documentation instructs you to add the peering route to the Databricks VPC's main route table. However, the subnets where Databricks cluster nodes actually run may be using custom (non-main) route tables instead of the main route table.

How to verify:
1. Go to VPC Dashboard > Subnets and find the subnets in your Databricks VPC
2. For each subnet, check the Route Table tab -- note which route table is associated
3. If any subnet uses a route table other than the main route table, you need to add the peering route to THAT route table as well
4. The route should be: Destination = SQL Server VPC CIDR / Target = peering connection ID

Your EC2 test instances may have been in a subnet that uses the main route table (where the peering route exists), while the Databricks cluster nodes land in subnets with a different route table that lacks the peering route.


CAUSE #3: DNS RESOLUTION NOT ENABLED ON THE PEERING CONNECTION

If you are connecting to your SQL Server by hostname (not IP), you need DNS resolution enabled on the VPC peering connection.

How to verify:
1. Go to VPC Dashboard > Peering Connections
2. Select your peering connection
3. Go to Actions > Edit DNS Settings
4. Ensure "Allow DNS resolution from the remote VPC" is enabled on both sides

You can test DNS resolution from a Databricks notebook:

%sh host <your-sql-server-hostname>

If the hostname doesn't resolve, this is your problem.

Docs (Step 4): https://docs.databricks.com/en/security/network/classic/vpc-peering.html


CAUSE #4: SECURE CLUSTER CONNECTIVITY (SCC) BEHAVIOR

All new Databricks workspaces use Secure Cluster Connectivity by default, meaning cluster nodes have no public IP addresses. All outbound traffic goes through a NAT gateway. Verify that the NAT subnet's route table also has the correct peering route.

Docs: https://docs.databricks.com/en/security/network/classic/secure-cluster-connectivity.html


QUICK DIAGNOSTIC STEPS

Run these from a Databricks notebook to pinpoint the issue:

# Test DNS resolution
%sh host <sql-server-hostname>

# Test TCP connectivity (port 1433 for SQL Server)
%sh nc -zv <sql-server-hostname-or-ip> 1433

# Check the route table from the node's perspective
%sh ip route

If "nc" times out but "host" resolves correctly, the problem is routing or security groups. If "host" fails, the problem is DNS.


SUMMARY CHECKLIST

- Security Group (SQL Server side): Inbound rule allows port 1433 from the Databricks Unmanaged SG ID
- Security Group (Databricks side): The Unmanaged SG allows outbound to the SQL Server VPC CIDR (should be allowed by default)
- Route Table (Databricks VPC): Peering route exists in the route table(s) actually associated with the cluster subnets -- not just the main route table
- Route Table (SQL Server VPC): Return route to Databricks VPC CIDR via the peering connection
- DNS Resolution: Enabled on the peering connection
- NACLs: Allow traffic in both directions on port 1433 (you mentioned these are OK)


IS THE DATABRICKS-MANAGED VPC A LIMITATION?

No, the Databricks-managed VPC supports VPC peering. The key difference is that you must use the Unmanaged security group (not the Managed one) when configuring access. If you ever need more control (PrivateLink, custom CIDR ranges, egress firewalls), you can consider a customer-managed VPC, but note that you cannot convert an existing workspace -- you would need to create a new one.

Docs: https://docs.databricks.com/en/security/network/classic/customer-managed-vpc.html


DOCUMENTATION REFERENCES

- VPC Peering with Databricks (full guide): https://docs.databricks.com/en/security/network/classic/vpc-peering.html
- Customer-Managed VPC: https://docs.databricks.com/en/security/network/classic/customer-managed-vpc.html
- Secure Cluster Connectivity: https://docs.databricks.com/en/security/network/classic/secure-cluster-connectivity.html

Hope this helps narrow it down! The security group + Unmanaged SG issue is the most common root cause for this exact symptom pattern.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

Hey Steve, thanks a lot for the detailed reply.

I was able to find the issue. The networking on AWS was configured correctly, however when I tried connection to the Database from Databricks I was using the Serverless, and I guess Databricks run them on their own infrastructure. However, then I noticed the "Compute" and from Compute I was able to reach my database.