cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Sai_Ponugoti
Databricks Employee
Databricks Employee

delta-sharing-primary-lockup-full-color-black-rgb-4000x660-2f69222.png

As organizations increasingly adopt multi-cloud strategies to leverage the unique strengths of various cloud platforms, they face the dual challenge of maintaining robust security while enabling efficient data sharing. Balancing accessibility with protection is crucial, especially when managing sensitive information across diverse cloud environments.

Many organizations choose to restrict public access to their cloud storage to safeguard data from potential threats. However, this creates complexities when sharing data securely between clouds without compromising security standards. Delta Sharing, an open protocol for secure data sharing, provides a solution by facilitating seamless, replication-free data exchange.

Despite its advantages, Delta Sharing requires that the recipient can establish a network connection to both the delta sharing server and the underlying storage, which can present challenges across wide area networks like the internet

At Databricks, we previously proposed an architecture to enable private cross-cloud delta sharing, using a Virtual Private Network (VPN) to access data provider storage. This approach allows for a no-public-access policy while ensuring that any traffic routed over the public network is protected with the additional encryption and other protections provided by a VPN (which many organisations already use to secure their external traffic). More details can be found here. The main limitation of this approach is that it only worked for Databricks classic compute. 

In this blog, we will extend the previous architecture to show how secure cross-could delta sharing can be done using serverless compute. We will demonstrate how this will be done with AWS serverless as the consumer and Azure Databricks as the producer.

Architecture Diagram

The architecture below shows the VPN set-up between Azure (as the Provider) and AWS (as the Recipient) as well as using AWS Private Link to establish an NCC connection between AWS VPC Endpoint and the Azure ADLS Service.

 

Serverless Private Delta Share ArchitectureServerless Private Delta Share Architecture

 

 

To control access to the internet from your serverless workloads, Databricks is enhancing egress control capabilities.  Please contact your account team to learn more. 

To establish a dedicated connection through AWS Serverless Compute to Azure Storage, two critical connections are required: the connection between the Serverless compute plane and the customer's VPC in AWS, and the connection from the customer's VPC to Azure ADLS.

In Databricks, serverless compute plane networking is managed by Network Connectivity Configuration (NCC) objects.


For optimal security, we recommend leveraging AWS PrivateLink for serverless, which is the method used in our example setup.

As previously mentioned we are using a site-site VPN connection to establish a cross cloud connection for delta sharing, this solution is based on the assumption that you already have a cross cloud connection established between Azure and AWS. 

Prerequisites

Before we proceed with the implementation, it is important to know the prerequisites needed to ensure a successful deployment.

  1. This Blog is based on the assumption that there is a connection between AWS and Azure - for this use case we have implemented a Site-Site VPN connection and the steps for establishing this can be found in the blog we have mentioned above.
  2. You must be using the Databricks Enterprise tier
  3. You must have Account Admin privileges to configure the Network Connectivity Config objects.
  4. You must have at least 1 functional workspace using serverless on AWS
  5. You have enabled AWS Privatelink (Private Preview) in your account console

Implementation Walkthrough

This section describes the detailed steps to configure a dedicated and secured connection from your serverless control plan to Azure ADLS service for your workspace

Step 1: Use or Create a Private Azure Data Lake Storage (ADLS)

For this step you can either create a new ADLS using the documentation here or you can use an already existing ADLS - To ensure the storage account in Azure Databricks is private, disable public access on the storage account and create dedicated private endpoints between the workspace VNet and the storage account.

image (10).pngPrivate ADLS with private endpointsPrivate ADLS with private endpoints

⚠️NOTE: Once Azure ADLS is available, please note down the Private IP of the Private end endpoint you have just created. This will be later used while configuring the target group.

Private IP of the ADLSPrivate IP of the ADLS

Step 2 - Create all the NLB resources required for the VPC endpoint service in your AWS account

We would need to create a Network Load Balancer routing to Azure ADLS and create a VPC endpoint service using the NLB.

I have referred to the following AWS Documentation to create a Network Load Balancer (NLB) and create a VPC endpoint service using the NLB.

  • Create NLB target group

    Step-a
    (specify group details)
    In this step please select the Target type as IP Addresses and select TCP port and keep the health checks as TCP and select your VPC where Databricks resides

Screenshot 2025-01-15 at 15.49.40.png
 

Step–b (Register Targets)
In this section, you would need to enter the IP address of the ADLS private endpoint which you have created in Step-1.       

ADLS Private IPADLS Private IP

This is the final result:

Target GroupsTarget Groups

  •  Create NLB


To create a NLB - navigate to the load balancers section under EC2 and select Network Load Balancers with the following configurations:

  1. Scheme-Internal
  2. VPC-Select the VPC where Databricks is deployed
  3. Listeners and routing - TCP port 443 and select the target group you have created earlier

This is the final result: image (15).png

Once the NLB has been created and the target groups are specified, you can go to the resource map section to check the status of your connection

Resource MapResource Map

Step 3 - Create a VPC endpoint service

VPC Endpoint ServiceVPC Endpoint Service

  • When creating the VPC endpoint service, select the load balancer type as “Network” and select the NLB you had created in Step-2

    Screenshot 2025-01-13 at 15.10.36.png

  • check the box Acceptance required for “Required acceptance for endpoint”:
    Screenshot 2025-01-15 at 16.05.02.png
  • After your VPC endpoint service is created, please allow the Databricks Serverless stable IAM role in the “Allow principals” tab. This will allow Databricks to create one VPC endpoint to link to your VPC endpoint service.
    Databricks serverless stable IAM role has the format
    arn:aws:iam::565502421330:role/private-connectivity-role-<region>
    For example, if your VPC endpoint service is in region eu-west-1, allowlist
    arn:aws:iam::565502421330:role/private-connectivity-role-eu-west-1

     

    Alternatively, you could also allowlist  '*' since the network security of your VPC endpoint service is also guaranteed, by manually accepting only the VPC endpoint Databricks created for your VPC endpoint service.

    Blank diagram - Page 1 (3).jpeg
    IMPORTANT: Ensure that “Enforce inbound rules on PrivateLink traffic” is not selected.

Step 4 - In the Databricks Account console, create a Network Connectivity Config (NCC) object

⚠️NOTE: You can skip this step if there is an existing NCC object and a Network Policy (with restricted access) object that you wish to use for your workspace.

Once enrolled, you can:

Log in as a Databricks admin. On the left pane of the Accounts console, navigate to Cloud resources → Network → Network Connectivity Configurations, click “Add Network Connectivity Configuration,” enter the NCC name and region, and click “Add” to create the NCC.

Please make sure the NCC you will create resides in the same region as your AWS VPC.image14.png

Step 5 - Create a private endpoint rule in the NCC

Please contact your account team to be enrolled in the privatelink preview

Select the NCC you created in Step 4, navigate to Private endpoint rules, click “Add private endpoint rule”, enter Endpoint service and Domain names, and click “Add” to create the private endpoint rule in the NCC.

The Endpoint service is the service name of the VPC endpoint service that you created in Step 4. In the example setup case, it is 

 

com.amazonaws.vpce.eu-west-1.vpce-svc-02112aa532dc17d2a.

 

The Domain name is the FQDN of the destination resource. In our case, it is

 

demoaonsiddharthponugoti.dfs.core.windows.net

 

which is the Azure ADLS you have created in Step-1.

⚠️NOTE: You should not include the prefix "https://" of the endpoint

Now the private endpoint rule should be showing the status as PENDING.

Blank diagram - Page 1 (4).jpeg

Step 6 - Approve the VPC endpoint connection request on the VPC endpoint service in your AWS account.

Go to the VPC endpoint service you created in Step 3, navigate to Endpoint connections, confirm the Endpoint ID matches the VPC endpoint that you created in Step 5, click Actions drop-down menu, select Accept endpoint connection request, and click Accept button on the pop-up window to approve the connection request.

image (20).png

Now go back to the Private endpoint rules page on Databricks Accounts console, wait for a minute, refresh the page, and the private endpoint rule status should change to ESTABLISHED

Blank diagram - Page 1 (5).jpeg

Step 7 - Attach the NCC object to your workspace and on Databricks Accounts console

On Databricks Accounts console, navigate to “Workspaces” on the left pane, select an existing workspace, click Update workspace to open Update workspace page, click Network Connectivity Configuration drop-down menu, select the NCC you created in Step 4, and click Update button to attach NCC object to the workspace.

Blank diagram - Page 1 (6).jpeg

Step 8 - Configuring the provider and recipients for Delta Share.

Now that you have configured the NCC and attached it to the workspace, you can now test it by creating a delta share, follow the Databricks documentation to set up delta sharing on your account for the provider side and this documentation to create and manage shares for Delta Sharing.

Azure Side: 

For this example I will be using a delta table file called customers_100 which has various details of customers of a superficial company, subsequently I have also created a share called deltashare-azure and added the Parquet file as an asset

Blank diagram - Page 1 (7).jpeg

After creating the Delta Share, I requested the recipient’s sharing identifier and created the recipient.

Blank diagram - Page 1 (9).jpeg

AWS Side:

Now on the AWS Side, In the catalog section, select delta sharing and then navigate to the deltashare sent by your provider.

You should see the create catalog option, please select that to create a new catalog

Blank diagram - Page 1 (17).jpeg

This is the catalog created using deltashare from Azure:

Blank diagram - Page 1 (18).jpeg

You have now successfully established a Delta Share connection, enabling you to seamlessly query your files stored in Azure ADLS from AWS Databricks using serverless compute on the recipient’s side.

⚠️NOTE: Please make sure you are using Serverless compute and not classic compute for testing.

Private cross cloud delta share using serverless computePrivate cross cloud delta share using serverless compute

To share additional files, simply add them to the Delta Share configured earlier, enabling you to query them on AWS Databricks. Similarly, you can easily remove files from the share if needed.

By doing the above steps, you can now use Serverless compute for Private Cross-Cloud Delta Sharing using Databricks

Use case cost:

In our previous blog, we had provided an example cost use case which shows an estimated incremental cost for sharing a 1GB table stored in S3 for 1 month via a Private Delta Sharing solution described here over an Azure-2-AWS VPN connection, which has an overall cost of $80.20 as of March 15, 2024.

Our current architecture leverages a combination of VPC Endpoints, AWS PrivateLink, Network Load Balancers with target groups, and serverless computing. This design ensures a high level of efficiency, security, and scalability, while the inclusion of these robust components contributes only marginally to the overall infrastructure cost.

Conclusion:

In this article, we have shown how organizations can create private cross-cloud delta sharing using Serverless Compute on Databricks. The approach mentioned above helps organizations share data between clouds securely without incurring significant costs.

illustration-literal-data-ai-white-2504x1464-bba7ded.png

2 Comments