Databricks Community

alysson_souza · ‎10-07-2024

Intro

In the first part of our series, we explored the architecture and some typical use cases for a new feature in Databricks that allows customers to connect their Serverless clusters to private resources in their cloud environment and on-premises using PrivateLink. This capability enables a secure and performant connection for data access while keeping the traffic within a private network.

As organisations scale, if not planned correctly, the complexity of their infrastructure grows, leading to challenges in managing numerous endpoints and services that Databricks Serverless will need to access to support a variety of use cases. For some customers who have a complex technology environment, the conventional approach of creating individual Network Load Balancers (NLBs) and PrivateLink Services for each resource quickly becomes hard to manage, particularly at an enterprise scale, where the number of services can run into the tens or hundreds.

In part two of this series, we will examine strategies for scaling your PrivateLink-enabled connectivity while minimizing operational and management overhead. We’ll discuss how to manage this complexity effectively, ensuring your architecture remains secure and scalable.

Challenge

To illustrate the challenges of scaling the architecture, let's consider a scenario where a customer needs to access two different APIs in their network: host1.example.com/api1 and host2.example.com/api2. Both APIs require HTTPS. In the PrivateLink model, Network Load Balancers (NLBs) operate at Layer 4 of the OSI model, meaning they can route traffic based only on IP addresses and ports. Since both APIs use HTTPS (port 443), you must create two separate NLBs - one for each API - because the NLB cannot distinguish between different HTTPS requests based on the hostname or path in the request. Additionally, two PrivateLink Services will be required on the cloud provider side, one for each NLB.

The need to replicate this process for every new API or service quickly becomes cumbersome and resource-intensive, particularly when dealing with tens or hundreds of APIs. Although one alternative would be to target different ports on the NLB to leverage its port-based routing capability, this approach could be better. Managing different ports for each API complicates the architecture and makes troubleshooting more challenging.

This complexity underlines the need for a more manageable approach as organisations increase their usage of Databricks and the number of services they need to access expands.

The diagram below illustrates the challenge highlighted above.

Blogathon Serverless PL Page 1.png

Figure 1 Complexity of connecting to multiple APIs

Solution

Now, let's examine how we can address these challenges. To help with this, I will use two public APIs for the tests, but as highlighted above, there could be many more Private APIs or services running in your cloud network or on-premises.

The APIs I will be using for illustration are:

https://api.agify.io?name=<name> - which predicts the age of a person based on the name provided in the query
https://api.nationalize.io?name=<name> - which predicts the nationality of a person based on the name provided in the query

To make both APIs available, we will configure the NLB used by our PrivateLink service to point to a Layer 7 proxy instead of targeting each private service or API directly (note that NLBs can't target a public IP address directly). For our sample architecture, we will use an EC2 VM to host the mitmproxy solution. For this blog, I chose mitmproxy because it provides a simplified configuration and great debugging capabilities that help me demonstrate how the traffic flows. However, most enterprise customers already have their preferred Layer 7 proxy solutions, which can be used for this purpose. Our new architecture is illustrated below:

Blogathon Serverless PL.png

Figure 2 Simplified architecture using L7 proxy

The diagram above illustrates how now we can have a single AWS PrivateLink Service and NLB and still be able to register many APIs that are accessible by your Databricks Serverless clusters.

With the PrivateLink Service created, we can now register the FQDNs and create the Endpoints using the Network Connectivity Configuration (NCC) resource in the Databricks Account. Please note that this feature is currently in Private Preview for Databricks in AWS, so you will need to work with your Databricks account team to enable it and get access to the documentation, including instructions and limitations.

For the APIs in our example above, the NCC configuration will look like this:

Figure 3 Serverless Network Connectivity Configuration

Now that we have a PrivateLink service created and the API hosts registered in the NCC using that PrivateLink service, we still need to cover two main requirements before this architecture can work:

Routing via the proxy
Inspecting HTTPS traffic

We will cover each one of these items in more detail below.

Important Note: This solution introduces additional components and complexity, so it should only be used when you need to enable connectivity to a large number of endpoints (> 5~10 endpoints). Additionally, it’s highly recommended to use a proxy solution that you are already familiar with to minimise the learning curve and ensure smoother implementation.

Routing via the proxy

Typically, routing traffic through a proxy requires configuring the client to use specific proxy settings, such as exporting environment variables such as http_proxy and https_proxy. While this approach works, it introduces additional configuration steps and potential complexity, especially when scaling to multiple clients.

However, in the pattern proposed here, we can simplify the process significantly by leveraging the network-level routing provided by the Network Load Balancer. Instead of configuring each client to use a proxy, we can route all HTTPS traffic from the Databricks Serverless clusters directly through the NLB (passthrough) to the port the proxy is listening to. The NLB handles the routing seamlessly, directing the traffic to the correct destination without needing explicit proxy configuration on the client side. This method is called Transparent Proxy.

This approach reduces the configuration burden and ensures that all traffic is routed securely and efficiently through the proxy. The screenshot below shows the NLB configuration, which listens for HTTPS traffic (port 443) and targets the mitmproxy EC2 instance on port 8080.

Figure 4 AWS Network Load Balancer Configuration

Now that the NLB will redirect all traffic on port 443 of our PrivateLink service to the proxy, let's try calling one of the APIs we have registered in the NCC. For this, we will use the Python code below from a notebook in Databricks:

import requests

# URL to make the request to
url = "https://api.agify.io?name=alysson"

try:
    response = requests.get(url)
        
    print(f"Status Code: {response.status_code}")
    print(f"Response Content: {response.text}")

except requests.exceptions.SSLError as ssl_error:
    print(f"SSL Error: {ssl_error}")

except requests.exceptions.RequestException as req_error:
    print(f"Request Error: {req_error}")

When running this command in a Serverless cluster, we get the following error:

SSL Error: HTTPSConnectionPool(host='api.agify.io', port=443): Max retries exceeded with url: /?name=alysson (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))

This error happens because the request is encrypted with the public API server's certificate, and our proxy cannot inspect it. In the next section, we will cover how to resolve this.

Inspecting HTTPS traffic

If your API is exposed via HTTP (not recommended), the proxy in the configuration above should already be able to retrieve the HTTP headers and route to the right endpoint. However, most APIs today use HTTPS, which provides better security by encrypting all the application-level information, including any headers included in the request. Since our proxy needs this information to route the request, and it is encrypted, what can we do?

Luckily, there is a solution for that! Most proxies can inspect HTTPS traffic by decrypting the requests before routing them to the appropriate backend. To do that, the client making the request needs to "trust" the proxy (if anyone could decrypt the requests, SSL wouldn't add much value). To establish the trust, the client must install the proxy's certificate authority (CA), which usually involves adding it to a key store like Keychain on MacOS or specific directories in the filesystem for Linux machines. Since we don't have this level of access to perform changes on the underlying hosts of the Databricks Serverless clusters, we can use Databricks Secrets or a Unity Catalog (UC) Volume to store the CA certificate and reference it when making an API request. The snippet below shows how this can be achieved in Python by storing the certificate in UC Volumes:

import requests

def call_api(url):
    try:
        # Make the request with the CA certificate
        response = requests.get(url, verify="/Volumes/aso_default/default/aso_volume/mitmproxy-ca-cert.pem")
        
        # Print the response status code and content
        print(f"Status Code: {response.status_code}")
        print(f"Response Content: {response.text}")

    except requests.exceptions.SSLError as ssl_error:
        # Handle SSL errors
        print(f"SSL Error: {ssl_error}")
    except requests.exceptions.RequestException as req_error:
        # Handle other request errors
        print(f"Request Error: {req_error}")

Now, let's call both APIs we are using in our example. From a Serverless Notebook, I will call the function above:

Figure 5 Serverless Notebook API calls via proxy

As you can see from the screenshot above, both API calls succeeded and returned the age and likely nationality based on my name (surprisingly close to getting both answers right).

With mitmproxy, I can see the requests coming to the proxy and read all of the details that are available after decryption:

Figure 6 Summary of requests on proxy

Figure 7 Details of decrypted request

Figure 8 Decrypted response returned by API

Conclusion

In conclusion, the approach outlined in this blog simplifies the process of scaling connectivity from Databricks Serverless to multiple backends. By leveraging Layer 4 routing and a central proxy, we eliminate the need for complex client-side configurations and reduce the proliferation of Network Load Balancers and PrivateLink services. This architecture not only eases management but also enhances scalability, making handling numerous APIs and services more practical. It’s important to note that this approach is most effective for backends, where traffic can be decrypted and routed via the proxy, ensuring secure and efficient connectivity across your environment.

Databricks Community

Private and Dedicated Connectivity Patterns for Databricks Serverless Using Private Link (Part 2)

Intro

Challenge

Solution

Routing via the proxy

Inspecting HTTPS traffic

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks