Databricks Community

HuamingLiu · ‎06-13-2024

With Databricks serverless networking, our goal is to make connectivity secure and simple, with minimal configuration. In turn, you can focus on the data and AI use-cases that matter most to you. One area that we’ve heard a lot of asks around is for keeping a workspace locked down from access to unauthorized resources, while still enabling in-cloud or sometimes even cross-cloud secure connectivity to sanctioned resources.

In this blog post, we will address one of the popular customer requests for such a cross-cloud scenario: locking down access to the internet from your serverless workloads on AWS, but enabling access to Azure Open AI through a dedicated, per-customer connection.

Architecture Diagram

Serverless Model Azure Private Connectivity.png

Design Walkthrough

To control access to the internet from your serverless workloads, we're enhancing egress control capabilities. Please use this form to join our previews, or contact your account team to learn more.

To establish a dedicated connection through the model serving endpoint to Azure OpenAI, two critical connections are required: the connection between the model serving endpoint and the customer's VPC in AWS, and the connection from the customer's VPC to Azure OpenAI.

1: Connection between model serving endpoint and customer VPC.

Databricks' serverless compute plane networking is managed by Network Connectivity Configuration (NCC). Each NCC container currently offers two options:

Stable IPs: Public IPs that provide access to your resources (public preview)
Private Endpoint Service: VPC endpoints that facilitate PrivateLink connections (private preview)

For maximum security, we recommend using AWS PrivateLink from serverless, which is also the approach we adopt in our example configuration.

2: Connection between customer VPC to Azure OpenAI.

Let’s take a closer look at the architecture:

Serverless Model Private Connectivity.png

At the heart of the setup, we deployed HAProxy on EC2 instances as a Layer 4 forwarding mechanism. Requests from the model serving nodes are routed through PrivateLink to HAProxy servers, which then forward these requests directly to Azure OpenAI. To enhance the enterprise-readiness of the solution, we implement several features:

On AWS:

Autoscaling for HAProxy Servers: This introduces better fault tolerance and availability. Our solution is configured to use multiple availability zones and based on your access pattern, you have the flexibility to configure the scaling criteria so that it always has the right amount of capacity.
Automated stable IP Assignment and Recycle: This is implemented via Autoscaling lifecycle hooks and AWS Lambda to minimize operational overhead. We will only pick the IPs that are allowed in Azure OpenAI firewall. Each scale-out event triggers a Lambda function that assigns an unassigned IP from the pool to the new instance. If no IP is available, the instance launch is aborted to maintain uninterrupted service.

On Azure:

Azure OpenAI Firewall: This restricts access to only authorized IP addresses for greater security. We establish an Elastic IP (EIP) pool for all HAProxy servers and configure Azure OpenAI service to permit access solely from this IP pool.

Constructing a VPN connection between Amazon VPC and Azure VNet is possible for those wishing to eliminate public access entirely. Please reach out to your account team for guidance if you would like to go this route.

Implementation Walkthrough

This section describes the detailed steps to configure a dedicated and secured connection to Azure Open AI service for your workspace.

Step 1 - Create and deploy an Azure OpenAI Service resource in your Azure subscription

Follow the Azure documentation to create an Azure OpenAI service and deploy a model.

⚠️NOTE: Once Azure Open AI service is available, please note down the deployment name and endpoint. This will be used later to configure the proxy server backend and construct the API URL when registering your MLFlow model in Databricks model registry.

Step 2 - Create all the AWS resources required for the VPC endpoint service in your AWS account

The following AWS resources are required to build the VPC endpoint service:

One VPC with multiple private subnets (for the network load balancer (NLB)) and public subnets (for proxy servers), and two security groups, one for NLB and one for the Launch Template
A pool of Elastic IPs (EIP) that will be attached to the proxy servers and used for Azure Open AI firewall whitelisting

⚠️NOTE: Please tag the EIPs properly as the tag key (not tag value) will be used to identify the EIP pool in the Lambda function. In our example, only the EIPs with tag key “dais24_eip” will be assigned to the proxy servers.

A Launch Template that specifies the instance configuration for the proxy server

⚠️NOTE: Optionally, you can write a shell script to install the proxy server and put it in the user data field in Advanced details section of Launch Template.

Below is a sample user data shell script:

#!/bin/bash

# Fetch the token required for IMDSv2
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

# Fetch the public IP address of the instance using the token
PUBLIC_IP=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/latest/meta-data/public-ipv4)

# Check if the public IP address is not empty
if [ -n "$PUBLIC_IP" ]; then
  echo "Public IP found: $PUBLIC_IP"
  echo "Installing HAProxy..."

  # Update the package repository and install HAProxy
  sudo yum update -y
  sudo yum install haproxy -y

  # Enable and start the HAProxy service
  # systemctl enable haproxy
  # systemctl start haproxy

  echo "HAProxy installation completed."

else
  echo "No public IP assigned to this instance. Skipping HAProxy installation."
fi

A NLB target group

⚠️NOTE: When creating the NLB, please select TCP port 443 in Basic configuration section and TCP protocol in Health checks section

A NLB
An Auto Scaling group

⚠️NOTE: Please set the initial desired capacity and minimum capacity to 0 when creating the Auto Scaling group. If these two parameters are not set to 0, the EC2 proxy servers will be launched immediately but no EIPs will be assigned. The reason is that the lifecycle hook does not exist at this point and the Lambda function that is used to assign EIPs will not be triggered. You need to manually update them to the actual values once the lifecycle hook is created.

An execution IAM role of the Lambda function

The following IAM permissions need to be granted to the role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "autoscaling:CompleteLifecycleAction"
      ],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": [
        "ec2:AssociateAddress",
        "ec2:DisassociateAddress",
        "ec2:DescribeInstances",
        "ec2:DescribeAddresses",
        "ec2:CreateTags"
      ],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": "logs:CreateLogGroup",
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

A Lambda function that is used to assign EIPs to the proxy servers

Below is a sample Python code of the Lambda function:

import boto3

def lambda_handler(event, context):
    # Create a new EC2 client
    ec2_client = boto3.client('ec2')
    as_client = boto3.client('autoscaling')

    # Get the list of all available Elastic IPs from EIP pool where tag-key filter needs to match the EIP tag key
    eips = ec2_client.describe_addresses(Filters=[{'Name': 'tag-key', 'Values': ['dais24_eip']}])
    aval_eips = [eip for eip in eips['Addresses'] if 'AssociationId' not in eip]

    if not aval_eips:
        raise Exception('No free EIPs available')

    instance_id = event['detail']["EC2InstanceId"]
    eip = aval_eips[0]['AllocationId']

    # Associate the EIP with the instance
    ec2_client.associate_address(AllocationId=eip, InstanceId=instance_id)

    # Complete the lifecycle action
    response = as_client.complete_lifecycle_action(
        LifecycleHookName=event['detail']["LifecycleHookName"],
        AutoScalingGroupName=event['detail']['AutoScalingGroupName'],
        LifecycleActionToken=event['detail']['LifecycleActionToken'],
        LifecycleActionResult='CONTINUE',
        InstanceId=instance_id
    )

⚠️NOTE: The Lambda function's timeout should be set to 10 seconds.

An EventBridge rule with the source as the Auto Scaling group lifecycle hook and target as the Lambda function.
An Auto Scaling group lifecycle hook that reacts to the scale out event.
Go back to the Auto Scaling group -> Instance management -> Create lifecycle hook:
Go back to the Auto Scaling group and change the desired capacity and minimum capacity from 0 to the desired capacity, such as 2.
Now go to Instance management tab and two EC2 proxy servers are launched:
Wait for a few minutes and verify EIP has been assigned to each proxy server

SSH into each proxy server separately and configure the Azure Open AI endpoint. In this example, we are using HAProxy server 2.8.3 and below is a sample configuration file at /etc/haproxy/haproxy.cfg:

global
    log /dev/log local0
    chroot /var/lib/haproxy
    pidfile /var/run/haproxy.pid
    maxconn 4000
    user haproxy
    group haproxy
    daemon
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats
    # utilize system-wide crypto-policies
    ssl-default-bind-ciphers PROFILE=SYSTEM
    ssl-default-server-ciphers PROFILE=SYSTEM

defaults
    mode http
    log global
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor except 127.0.0.0/8
    option redispatch
    retries 3
    timeout http-request 10s
    timeout queue 1m
    timeout connect 10s
    timeout client 1m
    timeout server 1m
    timeout http-keep-alive 10s
    timeout check 10s
    maxconn 3000

frontend main
    bind *:443
    mode tcp
    option tcplog
    default_backend aoai

backend aoai
    mode tcp
    server azure_openai dais24-aoai-demo.openai.azure.com:443 check

Now you can start HAProxy service on each EC2 instance and they should show healthy status on the target group page:

Step 4 - Create a VPC endpoint service

When creating the VPC endpoint service, check the box “Acceptance required” for “Required acceptance for endpoint”:
After your VPC endpoint service is created, please allow the Databricks Serverless stable IAM role in the “Allow principals” tab. This will allow Databricks to create one VPC endpoint to link to your VPC endpoint service.

Databricks serverless stable IAM role has the format
```
arn:aws:iam::565502421330:role/private-connectivity-role-<region> 
```
For example, if your VPC endpoint service is in region us-east-1, allowlist
```
arn:aws:iam::565502421330:role/private-connectivity-role-us-east-1
```
Alternatively, you could also allowlist * since the network security of your VPC endpoint service is also guaranteed by manually accepting only the VPC endpoint Databricks created for your VPC endpoint service.

Step 5 - In the Databricks Account console, create a Network Connectivity Config (NCC) object [preview] and a Network Policy object [preview]

You can skip this step if there is an existing NCC object and a Network Policy (with restricted access) object that you wish to use for your workspace.

Please contact your account team to be enrolled in both previews. Once enrolled, you can:

Log in as a Databricks admin. On the left pane of the Accounts console, navigate to Cloud resources -> Network -> Network Connectivity Configurations, click “Add Network Connectivity Configuration,” enter the NCC name and region, and click “Add” to create the NCC.

Go back to the Cloud resources -> Network -> Network Policies, click “Add Network Policy” to open Create new network policy page, enter the policy name, select “Restricted access” for Serverless Internet Access, click Create button to create the network policy.

Step 6 - Create a private endpoint rule in the NCC

Select the NCC you created in Step 4, navigate to Private endpoint rules, click “Add private endpoint rule”, enter Endpoint service and Domain names, and click “Add” to create the private endpoint rule in the NCC.

The Endpoint service is the service name of the VPC endpoint service that you created in Step 4. In our case, it is com.amazonaws.vpce.us-east-1.vpce-svc-090fa8dfc6922d838.

The Domain names is the FQDN of the destination resource. In our case, it is dais24-aoai-demo.openai.azure.com, the Azure Open AI service you created in Step 1. Please note that it doesn’t include the prefix “https://” of the endpoint.

Now the private endpoint rule shows PENDING status.

Step 7 - Approve the VPC endpoint connection request on the VPC endpoint service in your AWS account

Go to the VPC endpoint service you created in Step 4, navigate to Endpoint connections, confirm the Endpoint ID matches the VPC endpoint that you created in Step 6, click Actions drop-down menu, select Accept endpoint connection request, and click Accept button on the pop-up window to approve the connection request.

Go back to Private endpoint rules page on Databricks Accounts console, wait for a minute, refresh the page, and now the private endpoint rule shows ESTABLISHED status.

Step 8 - Attach the NCC object and Network Policy object to your workspace and on Databricks Accounts console

On Databricks Accounts console, navigate to “Workspaces” on the left pane, select an existing workspace, click Update workspace to open Update workspace page, click Network Connectivity Configuration drop-down menu, select the NCC you created in Step 5, and click Update button to attach NCC object to the workspace.

On the workspace configuration tab, click “Update network policy” button in the Network Policy box to open “Update workspace network policy” pop-up window, select the Network Policy you created in Step 5, and click Apply policy button to attach the Network Policy object to the workspace.

Step 9 - Configure Azure Open AI firewall to allow the pool of EIPs attached to the proxy servers

On Azure Open AI service you created in step 1, navigate to “Networking” on the left pane, select “Selected Networks and Private Endpoints”, enter the EIPs you created in step 2, and click Save button.

Step 10 - Verify the Model Serving endpoint can access Azure Open AI service through the Private endpoint rule in NCC

Log in to the workspace as a workspace admin and verify if the NCC and Network Policy are applied properly:

Run a Python notebook on an interactive ML cluster to register a model in your workspace model registry which attempts to access the Azure Open AI service. The following Python notebook will register a model named “dais24-aoai-model”:

import mlflow
import mlflow.pyfunc
import requests
import json

class TestSEGModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        pass

    def predict(self, _, model_input):
        first_row = model_input.iloc[0]
        api_key = "xxx"  # Please store the API key in Databricks secret and reference it from the notebook using dbutils.secrets.get
        api_url = "https://dais24-aoai-demo.openai.azure.com/openai/deployments/gpt35-demo/completions?api-version=2024..."
        prompt = first_row['prompt']
        headers = {'api-key': f'{api_key}', 'Content-Type': 'application/json'}
        json_data = {
            "prompt": prompt,
            "max_tokens": 128
        }
        try:
            response = requests.post(api_url, json=json_data, headers=headers)
        except requests.exceptions.RequestException as e:
            # Return the error details as text
            return f"Error: An error occurred - {e}"
        return [response.json()]

with mlflow.start_run(run_name='dais24-aoai-run'):
    wrappedModel = TestSEGModel()
    mlflow.pyfunc.log_model(
        artifact_path="dais24-aoai",
        python_model=wrappedModel,
        registered_model_name="dais24-aoai-model"
    )

Create a Model Serving endpoint serving the model you registered previously

Go to "Machine Learning" -> "Serving" in the navigation bar on the left side of the screen
Click "Create serving endpoint" button
Name the serving endpoint
In "Entity details", choose "Model registry model"
Select Model "dais24-aoai-model" and click "Confirm" button
Select "Compute type" and "Compute scale-out" and click "Create" button
Wait for the serving endpoint until the state shows "Ready"

Query the endpoint and verify the Azure Open AI connectivity

Request body:

{
  "dataframe_records": [
    {
      "prompt": "Write 3 reasons why you should train an AI model on domain specific data sets?"
    }
  ]
}

Since the Azure Open AI firewall whitelisted all EIPs attached to the proxy servers, the query should succeed with the following response:

Now intentionally change to the wrong IPs in Azure Open AI firewall:

Query the endpoint again and it should respond with the 403 access denied error:

Terraform Automation

We provided the Terraform code to help you quickly deploy all the AWS resources mentioned in Step 2 through Step 4. You just need to adjust the environment variables in myvars.auto.tfvars and run “terraform apply --auto-approve”.

Disclaimer

The Terraform code is provided as a sample for reference and testing purposes only. Please review, modify the code according to your needs, and fully test it before using it in your production environment.

Please keep in mind the following notes for the Terraform code:

For the proxy server, we use HAProxy as an example. You can choose other proxy services such as squid or nginx.
If you have existing EIPs in your AWS account, you can remove the “aws_eip” resource and manually add the tag key that is specified in the environment variable “eip_tag_key”.
Based on your workloads, please choose the instance type and auto scaling group maximum size for the best price and performance.

Summary

In this post, we presented a sample solution for establishing a secure and dedicated connection between Databricks' serverless model serving endpoint and the Azure OpenAI service. We explored the key design principles underpinning this solution and provided a Terraform template to facilitate immediate testing.