Databricks Community

KiranAnand

Authors: Kiran Anand, Suraj Karuvel

Introduction

This guide follows up on the previously published reference architecture document, “Azure Databricks — Serverless Private Connectivity to Customer’s Resources (Part-1).” The previous document presented a new option to connect Databricks Serverless to resources inside a VNet in your Azure tenant. This guide provides a sample reference architecture to expand that solution and extend Databricks Serverless connectivity to some of your resources deployed on-premises.

High-Level Architecture

In the previous guide, we discussed the option to connect the Azure Databricks Serverless compute to resources running in customers’ Azure tenant using a Standard Load Balancer. This provides the first mile of connectivity toward on-premises resources in our architecture. And in most cases, customers would have enabled connectivity between their Azure tenant and their on-premises network, which provides the last-mile connectivity in our architecture.

This document presents some options to extend Databricks Serverless connectivity further to on-premises systems by configuring a forwarder to use the existing networks and thereby providing a full network path between the first-mile and last-mile connectivity in our architecture. We will continue to use the same high-level architecture diagram presented in the previous guide. For better clarity, we have updated the diagram to mark the first-mile and last-mile connectivity as well as how they are tied together using a forwarder.

Note: Customers can establish such connectivity in multiple ways, and the options presented here are only for reference. Customers are advised to choose a model that fits their deployment and security models.

Other Considerations

The solution architecture and the samples presented in this document showcase the process of establishing private connectivity from the Databricks Serverless computing plane to some of your on-premises systems through an Azure VNet in your tenant. Exercise caution by performing proper testing while deploying this in higher-tier environments like Production. A Standard Load Balancer in Azure can only have VMs or VM Scale Sets as its backend pool. For deployments with high volumes of traffic or high-performance requirements, you should perform the required testing and consider the options to scale the backend pool.

Prerequisites

Establish private connectivity from Azure Databricks Serverless to the Customer’s Azure tenant using a Standard Load Balancer as described in the previous part of this blog. The official Databricks documentation is available here.
Connectivity between your Azure tenant and your on-premises using a recommended model from Microsoft, as described in the Azure documentation.

Extend Private Connectivity

This section defines a reference architecture to extend the previously established private connectivity from Azure Databricks Serverless to the Customer’s Azure tenant using a Standard Load Balancer. Using this model, customers can privately connect to services deployed outside the load balancer’s backend pool, which could include connectivity to resources deployed on-premises or in another cloud like AWS or GCP.

In this architecture, we set up a forwarder on the backend VM of the Azure Standard Load Balancer. This forwarder forwards traffic arriving from Databricks Serverless to its final destination service. We provide examples of three such services, which we have tested. You can choose any of these options to configure the forwarder in the backend VM of the Standard Load Balancer. Customers are advised to do their due diligence when setting up any of these solutions in their environment and production use cases, by conducting performance and scale testing per their requirements.

HAProxy

HAProxy is a free, open source, high-performance software load balancer and proxy that distributes network traffic across multiple servers to enhance availability, scalability, and performance for TCP and HTTP-based applications. HAProxy can be effectively utilized as a reverse proxy, acting as an intermediary server that sits in front of one or more backend servers and forwards client requests to them.

You can deploy HAProxy in the backend VM of the Azure Standard Load Balancer and configure it as a forwarder. For more details on the installation and configuration, refer to the official documentation from HAProxy. We are providing a few sample HAProxy configuration snippets to forward traffic from Databricks Serverless to an on-premises server or a server in another VNet. These snippets can be added to your HAProxy config file as needed.

The following snippet provides a sample working configuration for TCP traffic.

listen sqlserver
    bind *:1434 
    mode tcp
    option tcplog
    balance roundrobin
    server sqlserver1 sqlserver1.org.example.com:1433 check

2. The following snippet provides a sample working configuration for HTTP traffic.

frontend http_frontend
    bind *:80
    default_backend webservers

backend webservers
    balance roundrobin
    server webserver1 webserver1.org.example.com:80 check

Note: You can combine both snippets above based on your requirements.

Nginx

Nginx is a high-performance open source software that functions as a web server, reverse proxy, load balancer, and HTTP cache, designed for speed and stability. Nginx can be used for both TCP and HTTP-based applications.

You can deploy nginx as a forwarder in the backend VM of the Azure Standard Load Balancer. Refer to the official documentation from nginx for more details on the setup, installation, and configuration. Provided below are a few sample nginx configuration snippets to forward traffic from Databricks Serverless to an on-premises server or a server in another VNet. These snippets can be added to your nginx config file as needed. Please note that the samples provided below use the “stream” directive that requires the “stream” module to be enabled and loaded in nginx. In certain cases, this package might have to be installed additionally.

The following snippet provides a sample working configuration for TCP traffic.

stream {
    # --- TCP traffic (example - MySQL) ---
    server {
            listen 3306;
            proxy_pass db_backend;
        }

    upstream db_backend {
            server 10.79.0.82:3306;
        }
}

2. The following snippet provides a sample working configuration for HTTP traffic.

stream {
    # --- HTTP passthrough ---
    server {
    listen 80;
            proxy_pass http_backend;
        }

    upstream http_backend {
            server 10.79.0.82:80;
        }
}

Note: Make sure that the stream module is enabled and loaded in nginx.

Linux IP forwarding

IP forwarding, also known as kernel IP forwarding or routing, is a Linux kernel feature that allows a Linux system to act as a router. When enabled, the system can forward IP packets between different network interfaces, effectively passing traffic from one network to another. This service can be used to forward only TCP traffic.

You can configure an IP forwarder using the following open source script to connect from Databricks Serverless to an on-premises server or a server in another VNet.

https://github.com/sajitsasi/az-ip-fwd/blob/main/ip_fwd.sh

The script forwards incoming packets on Ethernet Interface eth0 on any defined port to the Destination FQDN or IP of the target Server on the desired port.

Note: The script provided above is only a reference. Please do the necessary testing and validation before using it in your environments.

Conclusion

In the previous part of this blog, we presented details on how to enable the first mile of networking from Databricks Serverless to the customer’s Azure tenant. And in most cases, customers usually have well-established networking between their Azure tenant and on-premises, which forms the last-mile connectivity in our architecture. As a follow-up to our previous post, this document presents a sample reference architecture that uses a forwarder to bridge the first-mile and last-mile connectivity. Here, we set up a forwarder on the backend VM of the Azure Standard Load Balancer, which forwards traffic arriving from Databricks Serverless to its final destination service. This end-to-end private connectivity architecture from Databricks Serverless to on-premises is an important step toward providing a seamless connectivity experience for customers, which enables them to quickly onboard workloads to Serverless.

Currently, private connectivity from Databricks Serverless to customer Azure tenants and on-premises environments is limited to the architecture outlined in this document, which utilizes Azure Standard Load Balancer. Additional connectivity options will be introduced to customers in the near future, and these will be detailed in subsequent blogs and articles as they become available.

Databricks Community

Azure Databricks — Serverless Private Connectivity to Customer’s Resources (Part-2)

Introduction

High-Level Architecture

Other Considerations

Prerequisites

Extend Private Connectivity

HAProxy

Nginx

Linux IP forwarding

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks