Databricks Community

tom-mcmeekin · ‎09-20-2024

Looking to securely connect your private resources to Databricks Serverless? In our latest blog series, we explore how Private Link offers a seamless, secure connection between your cloud environments and Databricks' serverless offerings. Learn how this feature helps keep your data private while leveraging the full power of serverless computing. Whether you're integrating Kafka for real-time streaming or using a private artifact repository, this guide breaks down the architecture, common use cases, and best practices for implementing Private Link with ease. This is Part 1 of a multi-part series, so stay tuned for deeper insights and advanced configurations in the upcoming posts. Dive into Part 1 now and discover how to simplify your networking infrastructure with Databricks!

Overview

At Databricks, security is one of our top priorities, and with data being your most valuable asset, we strive to enable customers to embed the right security controls to protect their data. One of the most common security protection requirements customers have is the ability to retain network paths within a private network for their users to access their data and for system integration. Private Link is a service provided by cloud providers like AWS, Azure, and Google Cloud that allows you to securely connect to services hosted on the cloud without exposing the data to the public internet. Essentially, it establishes a private, secure connection between your virtual network and the services you want to access, like Databricks.

The ever-expanding serverless products from Databricks provide hassle-free compute for running notebooks, jobs, and pipelines. Since the release of our first serverless product Databricks SQL, Databricks has continued to innovate on behalf of our customers to provide best-in-class performance and simplify the management of operating a data platform through the continued release of Serverless products across the Data Intelligence Platform. The objective with Databricks serverless networking is to provide secure connectivity with minimal configuration to remove undifferentiated heavy lifting, and to allow you to focus on the data and AI use cases that matter most.

Databricks provides Private Link connectivity for customers enabling private connectivity from your private cloud network (VPC/VNet) to both front-end and back-end interfaces of the Databricks infrastructure. The front-end endpoint ensures that your users connect to the Databricks web application, REST APIs, and JDBC/ODBC interface over your private network and your cloud provider's network backbone. The back-end endpoints ensure that classic compute clusters in your own managed VPC/VNet connect to the secure cluster connectivity relay and REST APIs over the cloud network backbone. In the context of serverless, the front-end endpoint continues to provide private connectivity for the same services as provided for classic compute.

With the recent release of serverless, Databricks will also soon be introducing support for private connectivity from serverless compute to customers' private cloud networks expanding the available integration capabilities. This blog is part one of a multi-part series to guide you through how Customers can use Private Link for Serverless compute to set up end-to-end private networking to the serverless Databricks Data Intelligence Platform.

High-level Architecture

With Databricks serverless networking introduces a new configuration construct called Network Connectivity Configuration (NCC). NCC is an account-level object that is used to manage private endpoints creation and firewall enablement at scale across your Data Intelligence Platform. The NCC is the configuration construct for managing your Private Link configurations from your cloud provider into Databricks Serverless.

High-level Architecture

What are the common use cases?

Databricks Private Link for serverless compute is a construct that can be used in AWS, Azure, and Google Cloud, however, this blog will explore some common use cases in AWS for demonstration purposes. Similar approaches can be used to establish private dedicated connectivity for similar workloads in Azure and Google Cloud.

AWS provides out-of-the-box support today for Network Load Balancer (NLB). A Network Load Balancer endpoint provides customers the most flexible interface for providing access from the Databricks Serverless compute plan to your private resources within your cloud environments.

AWS NLB is a layer 4 load balancer. For each registered endpoint into Databricks NCC, requests from Serverless compute will be routed to one or more registered targets associated to the registered AWS NLB endpoint. When the load balancer receives a connection request, it selects a target from the target group based on the default rule. It then tries to establish a TCP connection with the chosen target on the port defined in the listener configuration. For further information on AWS NLB and target group configuration, refer to the AWS NLB documentation on Target groups for your Network Load Balancers.

Before starting your implementation of private link for serverless computing we recommend spending the time mapping out connectivity use cases to determine suitable integration patterns. When choosing the integration pattern it’s important to consider the interface and connectivity protocol in which the service utilizes. Customers could benefit from simplifying the required serverless integration points for protocols like HTTP/S REST APIs, by consolidating through an API proxy (refer to part 2 of the blog series for a deep-dive on this). However for those protocols that aren’t REST-based, it is recommended to utilise a NLB to front-end the service as a private endpoint.

Use case deep dive

Some of the common use cases, but not limited to, for security-minded customers could be to resources like internal enterprise artifact repository managers, private API gateway endpoints, operational datastores, event streaming platforms, or establishing a secure and dedicated connection to Azure OpenAI. To help bring this to life let's double-click on one of the use cases you could consider to establish private connectivity to Databricks Serverless compute.

Example Use Case - Kafka integration

Connecting an event streaming platform, like Kafka, to Databricks Serverless compute for streaming ingestion and egress, can enable you to take advantage of near real-time processing provided by Apache Spark Structured Streaming or Delta Live Tables. In this example we have selected Amazon Managed Streaming for Apache Kafka (Amazon MSK) as the Apache Kafka cluster we will be connecting to, however, similar patterns can be utilized for self-managed Kafka clusters on EC2 and in other clouds.

The following diagram illustrates how access to an Amazon MSK cluster, via an AWS PrivateLink connectivity pattern, with a single NLB, can be established from Databricks Serverless compute.

Example Use Case - Kafka integration

In this setup, a single dedicated NLB is used, with each MSK broker having its own listener, and each listener operating on a distinct port (e.g. 8443, 8444, 8445). Each listener is associated with a unique target group, which contains only one registered target: the IP address of the corresponding MSK broker. Since these NLB listener ports differ for each broker ensure to update the advertised.listeners property for the broker nodes in the MSK cluster. Additionally, one target group includes all the broker IPs as targets and has a corresponding listener on port 9094. For information on configuring the Kafka Source in your application, refer to the Spark Structured Streaming documentation

An alternative pattern to consider is not sharing one NLB across multiple MSK brokers but instead having an independent NLB for each broker. Each NLB has only one listener listening on the same port (9094) for requests to each Amazon MSK broker. This alternative pattern has the advantage of not having to modify the advertised listener configuration in the MSK cluster and potentially reducing the need to reconfigure Apache Kafka clients. Whilst this simplifies the Kafka configuration layer, there is an additional cost of deploying more NLBs (one for each broker) and operational overhead that comes with that to consider.

Example Use Case - Artifact repository manager

Using a private PyPI repository for custom libraries in Databricks is essential for robust versioning and streamlined library management. While Databricks makes it easy to install public libraries, managing custom libraries can be more challenging, especially when version control is necessary to prevent disruptions in production environments. Traditional approaches like uploading Python Wheels to Blob Storage or installing directly from Git repositories lack proper version tracking and require cumbersome manual maintenance. By leveraging a private PyPI repository, such as JFrog Artifactory, organizations can ensure consistent versioning, simplify library updates, and maintain control over their proprietary code, all within a secure environment.

The following diagram illustrates how access to a JFrog Artifactory environment hosted on Amazon EC2, via an AWS PrivateLink connectivity pattern, with a single NLB, can be established from Databricks Serverless compute.

Example Use Case - Artifact repository manager

The JFrog Artifactory deployment pattern is an extension to the available AWS Quickstart reference architecture, with the inclusion of the Databrick serverless compute access via AWS PrivateLink. Using the FQDN configured in the NCC private link rule, Databricks users can specify the index-url of the private PyPI Repository to connect to your private artifact repository. The JFrog Artifactory deploys and configures AWS NLB and target group listeners on TCP port 8081 and 8082.

In addition to integrating Databricks serverless compute with private artifact repositories, in Databricks Serverless Notebooks introduces a new configuration construct for managing Python dependencies via the Environment side panel. This panel provides a single place to edit, view, and export a notebook’s library requirements for Databricks Serverless compute.

Considerations

Limits

NCCs are account-level regional constructs used to manage private endpoint creation and firewall enablement at scale. Each NCC can be attached to up to 50 workspaces, and each Databricks account can have up to 10 NCCs per supported region. Please work with your Databricks account teams if you have any questions regarding the mentioned limits.

Private Preview

Please note that at the time of authoring this blog this feature is currently in Private Preview. Please work with your Databricks account team to enable it and get access to the documentation.

Summary

In summary, securely connecting your private resources to Databricks Serverless using Private Link provides a robust solution for maintaining data privacy and integrity while leveraging the flexibility and performance of serverless computing. By utilizing Private Link, you can ensure that all communication between your serverless compute and private resources occurs within your private network, reducing exposure to the public internet. This first part of the series has highlighted the architecture, common use cases, and a detailed Kafka integration example. As you implement these patterns, consider the trade-offs in cost and configuration complexity, and stay tuned for the next part of this series, where we will delve deeper into additional configurations and advanced use cases.

In part two of the blog series we will examine strategies for scaling your Private Link-enabled connectivity while minimizing operational and management overhead.

Databricks Community

Private and Dedicated Connectivity Patterns for Databricks Serverless Using Private Link - Part 1

Overview

High-level Architecture

What are the common use cases?

Use case deep dive

Example Use Case - Kafka integration

Example Use Case - Artifact repository manager

Considerations

Limits

Private Preview

Summary

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks