Delta Sharing is an open protocol developed by Databricks for secure data sharing within an organization and externally, regardless of the computing platforms used. Depending on who you are sharing the data with, Delta Sharing can be used in two ways.
Databricks to Open (D2O) sharing or simply Open sharing lets you share data with any user regardless of whether they have access to Databricks.
Databricks-to-Databricks (D2D) sharing lets you share data with Databricks users who are using Unity Catalog (UC) with a metastore that is different from yours.
As there is usually a single UC metastore per cloud region for a given Databricks account, all workspaces in that particular cloud region will link to the same UC metastore and can access data seamlessly using the built-in governance capabilities.
In a multi-region or multi-cloud setup, there is a need to share data between regions/clouds, for which there are three options:
In this article, we will foray into D2D delta sharing on top of Azure Databricks.
Delta sharing across cloud regions works as follows:
A recipient using Azure Databricks in region A requests access to a dataset/table shared with them by the provider also using Azure Databricks in region B.
UC verifies the request and returns pre-signed URLs to the recipient.
The recipient then fetches the data “directly” from the Storage Account using these pre-signed URLs.
As you might have already guessed, this only works in the following cases:
The Azure storage account is publicly accessible. In other words, ADLS Gen2 should not have any firewall restrictions in place
The IP address or the CIDR range of the recipient(s) has been whitelisted on the provider’s storage account firewall
The communication between the recipient and the provider’s ADLS Gen2 is private and the provider side ADLS Gen2 firewall allows for this
We will delve deeper into case 3 above. In addition, note that for the scenario that we are interested in, D2D sharing, we do not really have any public IP addresses associated with Databricks workspaces to whitelist them on ADLS Gen2 firewalls.
Securely Accessing Storage Accounts
Given the fact that Databricks’ data plane runs in VNETs, there are 2 ways to access storage accounts from a Databricks workspace securely.
The recipient side Databricks data plane VNETs (public/host subnet) should be added to the provider side ADLS Gen2 network configuration. This can be done using a global/cross-region service endpoint. Until April 2023, Service Endpoints allowed secure storage account access from VNETs only within the same cloud region. But now service endpoints can be used cross-region as well. Cross-region service endpoints for Azure Storage became generally available in April 2023. Details are here.
Azure Private Link is the most secure way to access Azure data services from Azure Databricks. Although Service Endpoints and Private Endpoints both route the traffic between your virtual network and the storage account over the Microsoft network backbone, the Service Endpoint remains a publicly routable IP address, whereas the Private Endpoint is a private IP in the address space of the virtual network where the Private Endpoint is configured.\
Cross-Region Secure Data Access using a private endpoint
Note: Since the storage account firewall is on, the recipient fails to access the share (the Databricks workspace tries to fetch the files directly from the storage account).
3. Create a private endpoint in the (provider) storage account from the (recipient) Databricks workspace VNET in the other region.
The following configuration should be used for the setup:
Region: The region of the recipient Databricks workspace
Virtual Network: The VNET where the recipient Databricks workspace data plane is deployed
Subnet: One of the subnets in the recipient Databricks workspace VNET
Target sub-resource: dfs
Integrate with private DNS zone
4. Test the data access from the provider by the recipient.
Cross-Region Data Access from more than one Databricks workspace
If there is more than one Databricks workspace in isolated VNETs in the recipient region that needs to access the same storage account, then either we need to create a separate private endpoint for each VNET or we could peer VNETs and use a single private endpoint. In addition to peering VNETs, for each VNET we need to add a Virtual-Network-Link to the private DNS Zone, created during the setup of the private endpoint.
Getting started with Delta Sharing across cloud regions
We discussed the network configuration options available to access the data stored in the ADLS Gen2 storage account using Delta Sharing. Depending on your security requirements and budget constraints, you can either use Service Endpoints, which have no additional charges and are easier to set up, or use Private Endpoints which incur additional costs but are more secure.
In the next blog in this series, we will dive into cross-region D2D Delta Sharing on AWS as well as cross-cloud data sharing between Databricks workspaces across multiple cloud providers.