Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
For customers on the E2 Platform, Databricks has a feature that allows them to use AWS PrivateLink to provision secure private workspaces by creating VPC endpoints to both the front-end and back-end interfaces of the Databricks infrastructure. The front-end VPC endpoint ensures that users connect to the Databricks web application, REST APIs and JDBC/ODBC interface over their private network. The back-end VPC endpoints ensure that clusters deployed in a customer-managed VPC connect to the Databricks-managed secure cluster connectivity relay and REST APIs over a private endpoint.
We previously covered how customers can leverage AWS Route 53 Outbound resolver endpoints to allow workspaces deployed on their own VPC to resolve custom hostnames that can be hosted on customer managed DNS servers. When using PrivateLink for front-end, the workspace URL will need to resolve to the private IP of the PrivateLink interface in order to enable access to the workspace via a private connectivity (from on-premises or other connected VPCs).
In this blog, we are going to show how to leverage Route 53 Inbound Endpoints to enable DNS name resolution of workspaces with PrivateLink enabled for the front-end interface.
Architecture
The following diagram shows how a client on the customer's on-premises network sends a request to the corporate DNS server which has a forwarding rule configured for the privatelink.cloud.databricks.com domain. The DNS query is forwarded to the IP of the Inbound Resolver Endpoint in AWS which is associated with the Private Hosted Zone where a record exists with the record <region>.privatelink.cloud.databricks.com pointing to the private IP of the front-end PrivateLink interface.
The key components of the architecture are:
On-premises corporate DNS server with a forwarding rule for the privatelink.cloud.databricks.com domain.
Private connectivity between the corporate data centre and the AWS VPC. This connectivity can be established, for example, using AWS Direct Connect or an IPSec VPN.
A Private Hosted Zone (PHZ) in Route 53 for the privatelink.cloud.databricks.com domain. For each region where Databricks is deployed a record <region>.privatelink.cloud.databricks.com pointing to the Private IP of the front-end endpoint of that region is required. Note that the region is not the AWS region name but rather Databricks region name found on the address column of the table highlighted here. E.g., tokyo, seoul, mumbai, etc.
Route 53 Resolver Inbound endpoint to look for DNS records on the PHZ and provide the response back to the on-premises DNS Server.
Databricks workspaces with PrivateLink for the front-end interface (Web App and REST APIs). To achieve this an endpoint needs to be created in AWS, registered in Databricks, and a Private Access Setting object associated with the Workspace on the Databricks Account. The complete guide can be found here.
DNS Records
Before providing more details on the DNS record that needs to be created, it is important to understand how the workspace URL is resolved by Databricks before and after the front-end PrivateLink is configured.
Let's look at an example:
Note how the workspace URL, by default, is resolved by Databricks to another host using a CNAME record (line 3). The host in the example above is sydney.cloud.databricks.com which is used by the Databricks Control Plane in the Sydney region (ap-southeast-2). The recursive DNS lookup continues until a public IP address for that host is returned — in this case, 3.26.4.13 (line 6).
When configuring a workspace for front-end PrivateLink, one of the required steps is to update the workspace in the Databricks Account and attach a Private Access Setting (PAS). The PAS will specify which private endpoints registered in the Databricks Account, can be used to access that particular workspace. Once the workspace is updated and a PAS is attached, the DNS resolution for that workspace URL will be changed to the following:
Note how now the Workspace CNAME record resolves to another hostname: sydney.privatelink.cloud.databricks.com. This is useful because we don't need to override the public DNS zone cloud.databricks.com, which is not recommended since it can cause issues when resolving other hosts managed by Databricks on that public domain. Instead, we can simply override the privatelink.cloud.databricks.com domain which is only used for front-end PrivateLink.
Since our PHZ contains that record, the recursive DNS query will return the Private IP address of the interface registered on the PHZ (line 4), instead of the public IP as in the previous example. The table below describes the record required for your PHZ where the zone is privatelink.cloud.databricks.com:
+-------------+----------------------------------------------------------------------------+--------------------------------------------------------------------------------------+ | Record type | Record name | Value | +-------------+----------------------------------------------------------------------------+--------------------------------------------------------------------------------------+ | A | URL Example: region.privatelink.cloud.databricks.com | You should use the IP address of the front-end PrivateLink endpoint that you create | +-------------+----------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
Another consideration is that although each workspace has its own URL, the CNAME URL is shared for the entire region — e.g., sydney.privatelink.cloud.databricks.com — this means that all workspaces with front-end PrivateLink configured in that region will be accessible over the same PrivateLink interface.
Important: once a PAS is attached to a workspace it can’t be removed (only replaced by another PAS) so this configuration is irreversible.
1. Mixed deployments where certain workspaces have PrivateLink and others don't
In certain circumstances, you may need to have a mix where one or more workspaces are using front-end PrivateLink and others don't — for example, to allow users or applications outside the corporate network to connect to selected Databricks workspaces.
Using the architecture above, no further action will be required to enable this configuration since when users try to connect to a workspace without front-end PrivateLink they will automatically resolve to the Public IP of the Databricks Control Plane and be routed via the internet to that endpoint. Only workspaces with PAS attached will have the DNS resolution updated and will resolve to the Private IP address of the Databricks Workspace PrivateLink endpoint.
2. Hybrid deployments where certain clients connect to a Databricks Workspace over Public Endpoint and others via Private Endpoint
There is another scenario where you may want to enforce that certain clients only connect to a Databricks workspace via Private Endpoints while other approved clients will connect via the Public Endpoint. For example, you can enforce that all Databricks users in the corporate network only connect via Private Endpoints but allow certain SaaS applications to connect to Databricks via the Public Endpoint because they don't support private connectivity to your network.
To allow this hybrid setup, when creating the Private Access Settings, you need to make sure that the "Public access" setting is set to "True". This means that Databricks allows access to that workspace from the Private and Public Endpoints.
Once again, no additional changes are required on the DNS, PHZ and Resolver Endpoints. Clients on the corporate network will continue to be routed to the Private IP address of the PrivateLink endpoint while clients outside the network will resolve the workspace URL to the public endpoint.
To restrict which clients can connect to Databricks using the public endpoint, you can leverage the IP Access Lists feature to whitelist specific source public IP address that are allowed — typically these can be provided by external applications such as third party SaaS. More details about IP Access List configuration can be found here.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.