Hi there,
I'm helping a client of mine set up an Azure Databricks environment. The workspace is set up for private access only, and we are using Azure Firewall and Azure Private Link.
We have the network environment successfully configured to the point where we are able to start clusters in the workspace. However, when trying to run a simple SQL statement from within a notebook, I'm getting a very strange error:
CREATE CATALOG IF NOT EXISTS quickstart_catalog
com.databricks.common.client.UnexpectedHttpError: HTTP request failed with status: HTTP/1.1 302 Found
Looking at the Azure FW application rule logs, I see traffic outbound from the cluster to canadacentral.azuredatabricks.net, and it looks like the 302 is coming from a redirect following a failed authentication attempt. I suspect that the cluster is getting an HTML payload in response from the server (the Databricks login page), and the 302 is the redirect that happens when authentication fails (e.g. when trying to connect to a private workspace from outside the network). This is reinforced by what I see when I try to run USE CATALOG - the HTML payload that comes back is the Databricks login page:
USE CATALOG main
Py4JJavaError: An error occurred while calling o336.sql.
: shaded.v245.com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: <!doctype html><html><head><meta charset="utf-8"/><meta http-equiv="Content-Language" content="en"/><title>Databricks - Sign In</title><meta name="viewport" content="width=960"/><link rel="icon" type="image/png" href="/favicon.ico"/><meta http-equiv="content-type" content="text/html; charset=UTF8"/><link rel="icon" href="/favicon.ico"><script defer="defer" src="/static/js/login/login.fb760649.js"></script></head><body class="light-mode"><uses-legacy-bootstrap><div id="login-page"></div></uses-legacy-bootstrap></body></html>; line: 1, column: 2]
I suspect that something is incorrect with the Azure Firewall/Private Link setup, but I'm not entirely sure what. Quick summary:
- Simplified deployment (multiple subnets in one VNet)
- ADB Workspace in its own subnet, with NSG allowing relevant traffic outbound and UDR directing 0.0.0.0/0 to AFW
- Two PL endpoints (api_ui and sso_auth) in PL subnet, mapped to a private Azure DNS zone
- VNet has custom DNS settings (using two domain controllers running in Azure) - the DCs have a conditional forwarder for azuredatabricks.net to Azure DNS
- AFW has network rules in place allowing 1433, 3306, 9093, 6666 (TCP) and 123 (UDP) outbound from workspace to all sources
- AFW has application rule allowing outbound on 443 from workspace to *.azuredatabricks.net
- AFW has SNAT enabled for outbound traffic destined for the PL subnet (not sure if this is a best practice, but we couldn't start clusters yesterday until we enabled it)
I know the endpoints are working to some extent because I am able to both log into the workspace and start clusters (meaning the secure cluster connectivity relay is being established). However, I'm not really sure why attempts to run DB SQL are returning what looks like authentication errors from the DB login page. I suspect I am missing something with the Private Link/AFW setup. Any help would be much appreciated!
EDIT TO ADD:
When I try to use a non-UC catalog (such as the default hive_metastore), I don't get any errors. It's only when trying to run DB SQL against a UC-backed catalog.