As organizations increasingly adopt multi-cloud strategies to leverage the unique strengths of various cloud platforms, they face the dual challenge of maintaining robust security while enabling efficient data sharing. Balancing accessibility with protection is crucial, especially when managing sensitive information across diverse cloud environments.
Many organizations choose to restrict public access to their cloud storage to safeguard data from potential threats. However, this creates complexities when sharing data securely between clouds without compromising security standards. Delta Sharing, an open protocol for secure data sharing, provides a solution by facilitating seamless, replication-free data exchange.
Despite its advantages, Delta Sharing requires that the recipient can establish a network connection to both the delta sharing server and the underlying storage, which can present challenges across wide area networks like the internet
At Databricks, we previously proposed an architecture to enable private cross-cloud delta sharing, using a Virtual Private Network (VPN) to access data provider storage. This approach allows for a no-public-access policy while ensuring that any traffic routed over the public network is protected with the additional encryption and other protections provided by a VPN (which many organisations already use to secure their external traffic). More details can be found here. The main limitation of this approach is that it only worked for Databricks classic compute.
In this blog, we will extend the previous architecture to show how secure cross-could delta sharing can be done using serverless compute. We will demonstrate how this will be done with AWS serverless as the consumer and Azure Databricks as the producer.
The architecture below shows the VPN set-up between Azure (as the Provider) and AWS (as the Recipient) as well as using AWS Private Link to establish an NCC connection between AWS VPC Endpoint and the Azure ADLS Service.
Serverless Private Delta Share Architecture
To control access to the internet from your serverless workloads, Databricks is enhancing egress control capabilities. Please contact your account team to learn more.
To establish a dedicated connection through AWS Serverless Compute to Azure Storage, two critical connections are required: the connection between the Serverless compute plane and the customer's VPC in AWS, and the connection from the customer's VPC to Azure ADLS.
In Databricks, serverless compute plane networking is managed by Network Connectivity Configuration (NCC) objects.
For optimal security, we recommend leveraging AWS PrivateLink for serverless, which is the method used in our example setup.
As previously mentioned we are using a site-site VPN connection to establish a cross cloud connection for delta sharing, this solution is based on the assumption that you already have a cross cloud connection established between Azure and AWS.
Before we proceed with the implementation, it is important to know the prerequisites needed to ensure a successful deployment.
This section describes the detailed steps to configure a dedicated and secured connection from your serverless control plan to Azure ADLS service for your workspace
For this step you can either create a new ADLS using the documentation here or you can use an already existing ADLS - To ensure the storage account in Azure Databricks is private, disable public access on the storage account and create dedicated private endpoints between the workspace VNet and the storage account.
Private ADLS with private endpoints
⚠️NOTE: Once Azure ADLS is available, please note down the Private IP of the Private end endpoint you have just created. This will be later used while configuring the target group.
Private IP of the ADLS
We would need to create a Network Load Balancer routing to Azure ADLS and create a VPC endpoint service using the NLB.
I have referred to the following AWS Documentation to create a Network Load Balancer (NLB) and create a VPC endpoint service using the NLB.
Step–b (Register Targets)
In this section, you would need to enter the IP address of the ADLS private endpoint which you have created in Step-1.
ADLS Private IP
This is the final result:
Target Groups
To create a NLB - navigate to the load balancers section under EC2 and select Network Load Balancers with the following configurations:
This is the final result:
Once the NLB has been created and the target groups are specified, you can go to the resource map section to check the status of your connection
Resource Map
VPC Endpoint Service
When creating the VPC endpoint service, select the load balancer type as “Network” and select the NLB you had created in Step-2
arn:aws:iam::565502421330:role/private-connectivity-role-<region>
For example, if your VPC endpoint service is in region eu-west-1, allowlistarn:aws:iam::565502421330:role/private-connectivity-role-eu-west-1
Alternatively, you could also allowlist '*' since the network security of your VPC endpoint service is also guaranteed, by manually accepting only the VPC endpoint Databricks created for your VPC endpoint service.
⚠️NOTE: You can skip this step if there is an existing NCC object and a Network Policy (with restricted access) object that you wish to use for your workspace.
Once enrolled, you can:
Log in as a Databricks admin. On the left pane of the Accounts console, navigate to Cloud resources → Network → Network Connectivity Configurations, click “Add Network Connectivity Configuration,” enter the NCC name and region, and click “Add” to create the NCC.
Please make sure the NCC you will create resides in the same region as your AWS VPC.
Please contact your account team to be enrolled in the privatelink preview
Select the NCC you created in Step 4, navigate to Private endpoint rules, click “Add private endpoint rule”, enter Endpoint service and Domain names, and click “Add” to create the private endpoint rule in the NCC.
The Endpoint service is the service name of the VPC endpoint service that you created in Step 4. In the example setup case, it is
com.amazonaws.vpce.eu-west-1.vpce-svc-02112aa532dc17d2a.
The Domain name is the FQDN of the destination resource. In our case, it is
demoaonsiddharthponugoti.dfs.core.windows.net
which is the Azure ADLS you have created in Step-1.
⚠️NOTE: You should not include the prefix "https://" of the endpoint
Now the private endpoint rule should be showing the status as PENDING.
Go to the VPC endpoint service you created in Step 3, navigate to Endpoint connections, confirm the Endpoint ID matches the VPC endpoint that you created in Step 5, click Actions drop-down menu, select Accept endpoint connection request, and click Accept button on the pop-up window to approve the connection request.
Now go back to the Private endpoint rules page on Databricks Accounts console, wait for a minute, refresh the page, and the private endpoint rule status should change to ESTABLISHED
On Databricks Accounts console, navigate to “Workspaces” on the left pane, select an existing workspace, click Update workspace to open Update workspace page, click Network Connectivity Configuration drop-down menu, select the NCC you created in Step 4, and click Update button to attach NCC object to the workspace.
Now that you have configured the NCC and attached it to the workspace, you can now test it by creating a delta share, follow the Databricks documentation to set up delta sharing on your account for the provider side and this documentation to create and manage shares for Delta Sharing.
For this example I will be using a delta table file called customers_100 which has various details of customers of a superficial company, subsequently I have also created a share called deltashare-azure and added the Parquet file as an asset
After creating the Delta Share, I requested the recipient’s sharing identifier and created the recipient.
Now on the AWS Side, In the catalog section, select delta sharing and then navigate to the deltashare sent by your provider.
You should see the create catalog option, please select that to create a new catalog
This is the catalog created using deltashare from Azure:
You have now successfully established a Delta Share connection, enabling you to seamlessly query your files stored in Azure ADLS from AWS Databricks using serverless compute on the recipient’s side.
⚠️NOTE: Please make sure you are using Serverless compute and not classic compute for testing.
Private cross cloud delta share using serverless compute
To share additional files, simply add them to the Delta Share configured earlier, enabling you to query them on AWS Databricks. Similarly, you can easily remove files from the share if needed.
By doing the above steps, you can now use Serverless compute for Private Cross-Cloud Delta Sharing using Databricks
In our previous blog, we had provided an example cost use case which shows an estimated incremental cost for sharing a 1GB table stored in S3 for 1 month via a Private Delta Sharing solution described here over an Azure-2-AWS VPN connection, which has an overall cost of $80.20 as of March 15, 2024.
Our current architecture leverages a combination of VPC Endpoints, AWS PrivateLink, Network Load Balancers with target groups, and serverless computing. This design ensures a high level of efficiency, security, and scalability, while the inclusion of these robust components contributes only marginally to the overall infrastructure cost.
In this article, we have shown how organizations can create private cross-cloud delta sharing using Serverless Compute on Databricks. The approach mentioned above helps organizations share data between clouds securely without incurring significant costs.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.