Delta Sharing is an open protocol developed by Databricks for secure data sharing within an organization and externally, regardless of the computing platforms used. Depending on who you are sharing the data with, Delta Sharing can be used in two ways.
As there is usually a single UC metastore per cloud region for a given Databricks account, all workspaces in that particular cloud region will link to the same UC metastore and can access data seamlessly using the built-in governance capabilities.
In a multi-region or multi-cloud setup, there is a need to share data between regions/clouds, for which there are three options:
In this article, we will foray into D2D delta sharing on top of Azure Databricks.
Delta sharing across cloud regions works as follows:
As you might have already guessed, this only works in the following cases:
We will delve deeper into case 3 above. In addition, note that for the scenario that we are interested in, D2D sharing, we do not really have any public IP addresses associated with Databricks workspaces to whitelist them on ADLS Gen2 firewalls.
Given the fact that Databricks’ data plane runs in VNETs, there are 2 ways to access storage accounts from a Databricks workspace securely.
The recipient side Databricks data plane VNETs (public/host subnet) should be added to the provider side ADLS Gen2 network configuration. This can be done using a global/cross-region service endpoint. Until April 2023, Service Endpoints allowed secure storage account access from VNETs only within the same cloud region. But now service endpoints can be used cross-region as well. Cross-region service endpoints for Azure Storage became generally available in April 2023. Details are here.
Azure Private Link is the most secure way to access Azure data services from Azure Databricks. Although Service Endpoints and Private Endpoints both route the traffic between your virtual network and the storage account over the Microsoft network backbone, the Service Endpoint remains a publicly routable IP address, whereas the Private Endpoint is a private IP in the address space of the virtual network where the Private Endpoint is configured.\
The private endpoint setup for allowing access to a firewall-enabled storage account across the cloud region is as follows:
https://learn.microsoft.com/en-us/azure/databricks/data-sharing/share-data-databricks
Note: Since the storage account firewall is on, the recipient fails to access the share (the Databricks workspace tries to fetch the files directly from the storage account).
3. Create a private endpoint in the (provider) storage account from the (recipient) Databricks workspace VNET in the other region.
The following configuration should be used for the setup:
4. Test the data access from the provider by the recipient.
If there is more than one Databricks workspace in isolated VNETs in the recipient region that needs to access the same storage account, then either we need to create a separate private endpoint for each VNET or we could peer VNETs and use a single private endpoint. In addition to peering VNETs, for each VNET we need to add a Virtual-Network-Link to the private DNS Zone, created during the setup of the private endpoint.
We discussed the network configuration options available to access the data stored in the ADLS Gen2 storage account using Delta Sharing. Depending on your security requirements and budget constraints, you can either use Service Endpoints, which have no additional charges and are easier to set up, or use Private Endpoints which incur additional costs but are more secure.
In the next blog in this series, we will dive into cross-region D2D Delta Sharing on AWS as well as cross-cloud data sharing between Databricks workspaces across multiple cloud providers.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.