cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
KiranAnand
Databricks Employee
Databricks Employee

Introduction

Azure Databricks users often need to access on-premises resources, such as databases, that reside in their corporate networks. In most cases, the right network path, like ExpressRoute or a Site-to-Site VPN combined with a private endpoint or a load-balanced proxy, is enough to get traffic through. However, in some enterprise environments, the on-premises database sits behind a corporate firewall that blocks all inbound traffic, and the database or network team cannot open inbound ports, even for internal traffic. This is where most of the standard patterns stop working.

Previously, we published blogs on Standard Load Balancer based connectivity and Azure Application Gateway based connectivity for Azure Databricks Serverless. Both assume a reachable target in the customer’s Azure tenant or a reachable on-premises target over the existing network connectivity. In this post, we look at a different model, a reverse SSH tunnel proxy hub that works when inbound access to the on-premises network is not allowed.

In this blog, we walk through the architecture and present options to build a reverse-tunnel proxy hub with high availability. We will also show how this new connectivity pattern supports both Azure Databricks classic and serverless compute, without relaxing any on-premises inbound firewall restrictions. This is particularly useful for Lakeflow Connect customers ingesting from on-premises databases, where the connection must initiate from the customer’s network.

Note: We use MySQL on port 3306 as the running example throughout this post. The same pattern works for any TCP-based database — PostgreSQL on 5432, Oracle on 1521, SQL Server on 1433, and so on.

 

Why a reverse SSH tunnel?

In most enterprise networks, outbound traffic from on-premises to the cloud is permitted over ExpressRoute, a Site-to-Site VPN, or SD-WAN, while inbound traffic from the cloud into the on-premises network is restricted by the corporate firewall. A conventional connectivity model, where Azure Databricks initiates the connection to the on-premises database, requires inbound firewall rules to be configured. This may not be allowed in some environments, or may require a compliance review. A reverse SSH tunnel inverts this flow — the on-premises side initiates the connection, dialing out to a proxy virtual machine (VM) in the cloud over SSH. The proxy VM only needs to accept an outbound SSH connection, which the corporate firewall already permits.

The word “reverse” can be misleading. The SSH connection itself is a normal outbound connection from on-premises to the cloud. What is “reverse” is the direction of the application traffic that rides on top of it. When the on-premises host runs ssh -R against the proxy VM, it opens a listener on the proxy that forwards any incoming connection back over the SSH session to the on-premises database. Azure Databricks, sitting on the same VNet as the proxy VM, connects to that listener and reaches the database as if it were local to the cloud.

This direction aligns cleanly with the firewall policy in most enterprises. Outbound SSH (port 22) from on-premises to the cloud is almost always permitted for internal traffic. Inbound SSH is almost never permitted. The reverse tunnel takes advantage of this asymmetry and gives us a secure, authenticated path for database traffic without relaxing any firewall restrictions.

 

When to use this?

The reverse SSH tunnel pattern fills a specific gap in the connectivity options for Azure Databricks. This pattern is the right fit when:

  • Your on-premises target is reachable from Azure over an existing network path (ExpressRoute, Site-to-Site VPN, or SD-WAN), but the corporate firewall blocks inbound traffic from the cloud.

This pattern is not the right fit when:

  • Your on-premises target is reachable, and inbound access is already permitted. Simpler patterns (private endpoint, load-balanced proxy, or direct connection) cover the case more directly.
  • No network path exists between Azure and your on-premises network. Establish ExpressRoute, VPN, or SD-WAN connectivity first before considering any cloud-to-on-premises pattern.

 

Prerequisites

Component

Notes

On-premises Tunnel host

Runs “autossh”. Needs outbound port 22 to the proxy VMs and network access to the on-premises database.

Proxy Hub VNet in Azure

Connectivity to on-premises via ExpressRoute, Site-to-Site VPN, or SD-WAN.

Proxy VMs in the Proxy Hub VNet

At least one; two for high availability. Run “socat” and the HTTP health check.

Standard Load Balancer

Required for high availability (multi-VM). Optional for development or single-VM setups.

Network path to Databricks

VNet peering (classic compute) or NCC + Private Endpoint + Private Link Service (serverless compute).

 

Architecture

The solution consists of three zones — the on-premises network, the proxy hub VNet in Azure, and the Azure Databricks compute plane. Traffic flows from Databricks, through the proxy hub, and over the reverse SSH tunnels into the on-premises database.

ReverseTunnel-ArchDiagram.png

The proxy hub VNet in the middle is the core segment or zone of this architecture. We will mostly focus on building out the following components:

  • Proxy VMs (two or more for high availability). Each proxy VM accepts an incoming reverse SSH connection from the on-premises tunnel host and runs two additional services:
    — socat, which bridges the VM’s network interface to the tunnel listener
    — A lightweight HTTP health check endpoint that the load balancer probes.
  • Standard Load Balancer. An internal Standard Load Balancer (SLB) sits in front of the proxy VMs. It provides a stable frontend IP, distributes traffic across the backend pool, and removes unhealthy VMs when the HTTP health probe fails. This is what gives us automatic failover when a tunnel fails.
  • Private Link Service (PLS). The PLS exposes the SLB frontend to Azure Databricks serverless compute. Serverless compute lives in a Databricks-managed VNet and cannot use VNet peering to reach customer resources, so it needs the PLS combined with a Private Endpoint (PE) rule in a Network Connectivity Configuration (NCC) on the Databricks side.

On the on-premises side, a single Linux host (physical or virtual) establishes the reverse tunnels. This tunnel host runs one autossh process per proxy VM, and each autossh process carries one or more -R port forwards that map a high port on the proxy VM to the on-premises database. The tunnel host does not need anything special beyond network access to the database and outbound SSH to the proxy VMs.

On the Azure Databricks side, classic compute reaches the SLB frontend through VNet peering, and serverless compute reaches it through the NCC Private Endpoint that targets the PLS. Both paths converge on the same SLB, so we serve both compute types with one proxy hub.

 

How to make the tunnel reachable?

By default, ssh -R binds its listener on the remote side to the loopback interface (localhost:13306). This is controlled by the GatewayPorts setting in sshd_config, which ships as no on most Linux distributions. The default is deliberate — a reverse-forwarded port is not automatically exposed to the rest of the network. In our tests, we keep this default, which means the tunnel sits safely on the proxy VM’s loopback and is not reachable directly from the NIC.

So, for the default model, we need a way to bridge traffic from the NIC to the tunnel on the loopback interface. That is where socat comes in. On each proxy VM, we run a small socat process that listens on the network interface (0.0.0.0:3306) and forwards whatever it receives to localhost:13306, where the tunnel is waiting. Traffic from Databricks flows into the NIC, through socat, into the tunnel, and out to the on-premises database.

Note: You can skip socat by setting GatewayPorts yes on the proxy VM’s SSH server and binding the -R forward to 0.0.0.0 directly. The tunnel then listens on the NIC itself. Compared to the default model, this removes the loopback boundary as an extra defense-in-depth layer — access control falls entirely on NSG rules, source IP filtering, and database authentication. We keep the sshd default and let socat handle the bridge explicitly.

 

How to make the tunnel resilient?

With two or more proxy VMs behind the SLB, we already get redundancy for many common failures. If a proxy VM crashes or becomes unreachable, a simple TCP health probe lets the SLB detect it and stop sending traffic to it, and the remaining VM keeps serving. But what if just the SSH tunnel on a VM dies, while the VM itself and the socat service on it stay healthy? In that case, a plain TCP health probe on port 3306 detects that socat is listening and marks the VM as healthy, even though no traffic actually reaches the database. Databricks connections start failing intermittently while the SLB reports the VM as healthy.

To overcome this challenge and detect that state, we run a small HTTP health check service on each proxy VM. The HTTP service listens on port 8080, and for every incoming request, it attempts a TCP connection to the tunnel port on loopback (localhost:13306). If the connection succeeds, the tunnel is alive, and we return HTTP 200. If the connection fails, the tunnel is dead, and we return HTTP 503. The SLB is configured to probe this HTTP endpoint instead of probing port 3306 directly.

With this setup, the SLB reflects the actual health of the tunnel, not just the listener in front of it. When a tunnel dies on one proxy VM, the health probe starts returning 503, the SLB removes the VM from rotation within a few seconds (about 10), and traffic flows entirely through the remaining VM. When the tunnel recovers (autossh reconnects), the probe returns 200 again, and the VM rejoins the pool automatically.

Note: SLB failover redirects new connections only. Queries already in flight on the failed proxy VM may be dropped and need to be retried by the Databricks client (JDBC driver, foreign catalog query, or Lakeflow Connect gateway). Once retried, the connection is routed to the healthy VM.

 

Building the tunnel

With the proxy hub designed, we now build the tunnel itself. This has two ends — the cloud side, where each proxy VM accepts the reverse SSH connection, and the on-premises side, where the tunnel host initiates the connection out to the cloud. We walk through both in turn.

Note: The commands and sample scripts/configs in this section assume Ubuntu. Adapt package manager commands, file paths, service names, and parameters as needed for your environment.

Cloud side — the proxy VM

Each proxy VM runs a stock Linux OS with three pieces added — the socat service that bridges the NIC to the tunnel, a small Python-based HTTP health check that the SLB probes, and no changes to the default sshd configuration.

1. Install socat from the package repository:

sudo apt-get install -y socat

2. A sample systemd unit for socat (runs at boot and restarts if it ever exits):

# /etc/systemd/system/socat-db-proxy.service
[Unit]
Description=socat DB proxy (NIC to SSH tunnel)
After=network.target

[Service]
ExecStart=/usr/bin/socat TCP-LISTEN:3306,bind=0.0.0.0,fork,reuseaddr TCP:localhost:13306
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

The TCP-LISTEN:3306,bind=0.0.0.0,fork,reuseaddr binds on port 3306 across all network interfaces, handles each incoming connection in a separate child process, and allows the port to be reused quickly if the service restarts. TCP:localhost:13306 is the forward target — the tunnel listener on the loopback interface.

3. Next, the health check. We implement it as a small Python HTTP server. On every incoming GET request, it attempts a TCP connection to the tunnel port on the loopback interface. If the connection succeeds, it returns HTTP 200; otherwise, 503. A sample implementation:

# /usr/local/bin/tunnel-health-check.py
#!/usr/bin/env python3
import socket
import http.server

TUNNEL_PORT = 13306

class HealthHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        try:
            s = socket.create_connection(("127.0.0.1", TUNNEL_PORT), timeout=3)
            s.close()
            self.send_response(200)
        except Exception:
            self.send_response(503)
        self.end_headers()

    def log_message(self, *args):
        pass

http.server.HTTPServer(("0.0.0.0", 8080), HealthHandler).serve_forever()

Make it executable:

sudo chmod +x /usr/local/bin/tunnel-health-check.py

A sample systemd unit for the health check:

# /etc/systemd/system/tunnel-health-check.service
[Unit]
Description=Tunnel health check HTTP endpoint
After=network.target

[Service]
ExecStart=/usr/bin/python3 /usr/local/bin/tunnel-health-check.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

The standard sshd on the proxy VM accepts the reverse SSH connection from the on-premises tunnel host. We leave GatewayPorts to no as the default, which keeps the tunnel listener safely on the loopback interface, as explained earlier. No changes to sshd_config are needed for the reverse tunnel to work.

Finally, enable and start both services:

sudo systemctl daemon-reload
sudo systemctl enable socat-db-proxy tunnel-health-check
sudo systemctl start socat-db-proxy tunnel-health-check

Security Note: Apply NSG rules on the proxy VM to limit access as suggested below.

— Inbound SSH (port 22): Restrict to the on-premises CIDR or specific tunnel host IPs.

— Inbound database port (e.g., 3306): Restrict to the Databricks VNet CIDR (classic) and the PLS NAT subnet (serverless).

— Inbound health probe port (e.g., 8080): Restrict to the Azure load balancer probe source 168.63.129.16/32.

The proxy VM is now ready to accept reverse SSH tunnels and serve traffic to Databricks through the SLB.

On-premises side — the tunnel host

The on-premises tunnel host is a Linux host (physical or virtual) with network access to the database and outbound SSH (port 22) to the proxy VMs. Because the tunnel needs to stay up through network blips, proxy VM restarts, and other connection drops, we use autossh, a wrapper around ssh that automatically reconnects when the SSH session dies. On the tunnel host, we set up three pieces: autossh, an SSH key pair for authenticating to each proxy VM, and one systemd service per proxy VM so the tunnel comes up at boot and stays alive.

1. Install autossh from the package repository:

sudo apt-get install -y autossh

2. Generate an SSH key pair for the tunnel and copy the public key to each proxy VM:

ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa_tunnel -N ""
ssh-copy-id -i ~/.ssh/id_rsa_tunnel <user>@<proxy-vm-1-ip>
ssh-copy-id -i ~/.ssh/id_rsa_tunnel <user>@<proxy-vm-2-ip>

3. Wrap autossh in a systemd service so it starts at boot and restarts if it ever exits. Because each autossh process maintains exactly one outbound SSH connection, we create one systemd service per proxy VM. A sample unit for the first proxy VM:

# /etc/systemd/system/ssh-tunnel-proxy1.service
[Unit]
Description=SSH Reverse Tunnel to Proxy VM 1
After=network-online.target
Wants=network-online.target

[Service]
User=<user>
ExecStart=/usr/bin/autossh -M 0 -N \
  -R 13306:<db-ip>:3306 \
  -o ServerAliveInterval=30 \
  -o ServerAliveCountMax=3 \
  -o ExitOnForwardFailure=yes \
  -o StrictHostKeyChecking=accept-new \
  -i /home/<user>/.ssh/id_rsa_tunnel \
  <user>@<proxy-vm-1-ip>
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Let’s walk through the flags:

  • -M 0  disables autossh’s built-in monitoring port. By itself, that means autossh would never detect a dead connection. We get that detection from SSH’s own keepalive mechanism, which is why -M 0 is always paired with ServerAliveInterval and ServerAliveCountMax.
  • ServerAliveInterval=30 — sends a keepalive every 30 seconds.
  • ServerAliveCountMax=3 — means after three missed keepalives (about 90 seconds), SSH treats the connection as dead and exits. At that point, systemd restarts autossh, which establishes a fresh tunnel.
  • ExitOnForwardFailure=yes — exits SSH immediately if the remote port forward cannot be bound, which is useful when a proxy VM is briefly unreachable at connect time.
  • -R 13306:<db-ip>:3306 — is the reverse port forward — open a listener on localhost:13306 on the proxy VM and forward everything to the on-premises database at <db-ip>:3306, which the tunnel host reaches over the local network.

Create a matching ssh-tunnel-proxy2.service for the second proxy VM (change the service name, description, and target IP). Then enable and start both:

sudo systemctl daemon-reload
sudo systemctl enable ssh-tunnel-proxy1 ssh-tunnel-proxy2
sudo systemctl start ssh-tunnel-proxy1 ssh-tunnel-proxy2

For a multi-database setup, extend each autossh command with additional -R forwards, mapping a different high-port tunnel port to each database:

autossh ... \
  -R 13306:<mysql-ip>:3306 \
  -R 15432:<postgres-ip>:5432 \
  -R 11433:<sqlserver-ip>:1433

Each proxy VM’s socat configuration then bridges the corresponding NIC port to each tunnel port.

Security Note: Lock down the on-premises tunnel host as well.

— Outbound port 22: Restrict the on-premises firewall to allow port 22 outbound only to the specific proxy VM IPs.

— SSH key hardening: Restrict the tunnel SSH key to forwarding only. On each proxy VM, find the entry for the tunnel public key in `~/.ssh/authorized_keys` (the one added by `ssh-copy-id` earlier) and prefix it with `restrict,permitlisten="13306"` (add a `permitlisten` for each forwarded port). This blocks shell access, PTY allocation, and agent forwarding.

With both ends in place, the tunnel is live — the on-premises host maintains one autossh process per proxy VM, each proxy VM has the tunnel listener on loopback plus socat bridging to the NIC, and the SLB probes the health check to detect and route around any tunnel failures.

 

Failover testing

To validate the HA design end-to-end, we ran a set of controlled failure tests against a two-VM proxy hub. The setup mirrored the architecture described above — two proxy VMs behind an SLB, a Databricks foreign catalog for JDBC queries, and a Lakeflow Connect CDC pipeline for a long-lived connection test. The goal was to confirm that a tunnel failure is detected quickly, does not interrupt Databricks traffic, and recovers automatically. We also ran a few negative tests. As shown in the table below, the results of these tests validate the specific design choices we made, such as the HTTP health check and pointing the connection at the SLB for CDC.

#

Test scenario

Expected result

Observed result

1

JDBC query through SLB, both VMs are healthy

Query succeeds (baseline)

Query succeeded

2

JDBC query, one tunnel stop

ped, plain TCP probe (no HTTP health check)

TCP probe sees socat as healthy; SLB keeps routing to dead path; queries fail intermittently

As expected — SLB kept routing to the dead path, queries failed intermittently

3

JDBC query, one tunnel stopped, HTTP health check enabled

SLB detects and fails over; queries continue to succeed

SLB detected within a few seconds, routed traffic to the remaining VM, and queries continued to succeed

4

Foreign catalog query, classic compute over VNet peering

Query succeeds (baseline)

Query succeeded

5

Foreign catalog query, serverless compute over PE → PLS → SLB

Query succeeds (baseline)

Query succeeded

6

Lakeflow CDC pipeline, connection points at a specific proxy VM, that VM's tunnel stopped

Gateway keeps retrying the dead VM; pipeline fails persistently until the connection is changed

As expected — gateway kept retrying the same dead VM, pipeline failed persistently until we reconnected at the SLB

7

Lakeflow CDC pipeline, connection points at SLB, one tunnel stopped

Gateway reconnects through SLB to a healthy VM; pipeline resumes without data loss

Existing gateway connection broke, retry reconnected through SLB, pipeline resumed from the last binlog position, no data loss

The most important takeaway from all these tests — point Databricks connections at the SLB frontend rather than a specific proxy VM. The SLB is what carries the failover through.

 

Conclusion

The reverse SSH tunnel proxy hub we presented in this post is one of many patterns to connect Azure Databricks classic and serverless compute to on-premises resources. We have previously published blogs and detailed reference architectures for several such connectivity patterns. What makes this model different is the direction of connection initiation — the on-premises host initiates the connection outbound (on-premises to cloud) to avoid relaxing firewall restrictions, and the return traffic flows back (cloud to on-premises) over the established path.

If your on-premises security policy prohibits inbound connections from the cloud, and changing that policy isn't on the table, the reverse SSH tunnel proxy hub gives you a documented, validated path to reach your databases from both Azure Databricks classic and serverless compute. The architecture requires no inbound firewall changes on-premises, recovers automatically from tunnel failures, and supports multiple databases over a single hub.

For the Databricks-side configuration, refer to the official Azure Databricks documentation on Connect to on-premises databases using an SSH reverse tunnel.