cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Compute cluster in Azure workspace is unable to access Unity Catalog volume on storage account

mzs
Contributor

Hi,

I'm setting up a workspace in Azure with VNet injection. I'm able to upload files to a Unity Catalog managed storage account volume through the web UI, and access them from notebooks using serverless compute, for example, `dbutils.fs.list("/Volumes/mycatalog/myschema/myvolume")`. 

The same dbutils.fs.list() call fails from a classic all-purpose compute cluster with `com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation`. I get the same issue for `spark.read.csv(...)` for a path on the volume.

Some points about the setup:

  • Azure storage account with HNS enabled
    • shared key access disabled (have tried enabling, didn't help)
    • default-deny network rules (have tried opening up, didn't help)
      • public and private subnets are allowed, and NCC subnets
      • public and private subnets have the Microsoft.Storage.Global service endpoint (and others recommended in the docs: Microsoft.Sql, Microsoft.KeyVault, Microsoft.EventHub)
      • "Allow trusted Microsoft services to access this account" is enabled
  • Unity Catalog with metastore. I am not using the default metastore storage location (there is none)
  • catalog storage root is set to `abfss://mycontainer@myaccount.dfs.core.windows.net`
  • access connector (azurerm_databricks_access_connector in Terraform) with system-assigned managed identity
  • Databricks storage credential, ISOLATION_MODE_ISOLATED, using the access connector/managed identity
  • The managed identity has the Storage Blob Data Contributor role on the storage account
  • secure cluster connectivity
  • private endpoint for databricks_ui_api (Azure private link simplified deployment)
  • workspace has public network access enabled (hybrid public/private)
  • all-purpose cluster is 16.4 LTS, SHARED (aka Standard)

Things I've tried that have not helped:

  • allowed all public access to the storage account
  • enabled shared key authentication on the storage account
  • removed Microsoft.Storage.Global service endpoints from private and public subnets
  • I found an additional `dbmanagedidentity` managed identity in the managed resource group and tried assigning it the Storage Blob Data Contributor role on the storage account
  • enabling all diagnostic logs on the storage account - I see nothing, no failed storageread attempts

The next thing I'll try will be enabling a private endpoint on the storage account, but I'd rather not because it seems like it should work with service endpoints (ref) and would that would avoid needless bandwidth charges. Has anyone run across this before?

Below is more of the exception:

com.databricks.rpc.UnknownRemoteException: Remote exception occurred:
	com.databricks.backend.daemon.data.server.FailedOperationAttemptException: Metadata operation failed
		at com.databricks.backend.daemon.data.server.DefaultMetadataManager.doReadFile$1(MetadataManager.scala:735)
		at com.databricks.backend.daemon.data.server.DefaultMetadataManager.$anonfun$readMountFile$8(MetadataManager.scala:791)
		at com.databricks.backend.daemon.data.server.DefaultMetadataManager.withRetries(MetadataManager.scala:1032)
		at com.databricks.backend.daemon.data.server.DefaultMetadataManager.$anonfun$readMountFile$6(MetadataManager.scala:791)
		at scala.util.Try$.apply(Try.scala:213)
		at com.databricks.backend.daemon.data.server.DefaultMetadataManager.readMountFile(MetadataManager.scala:791)
		at com.databricks.backend.daemon.data.server.DefaultMetadataManager.getMountFileState(MetadataManager.scala:631)
		at com.databricks.backend.daemon.data.server.DefaultMetadataManager.getMounts(MetadataManager.scala:835)
		at com.databricks.backend.daemon.data.server.handler.MountsGetHandler.receive(MountsGetHandler.scala:31)
		at com.databricks.backend.daemon.data.server.handler.MountHandler.receive(MountHandler.scala:104)
		at com.databricks.backend.daemon.data.server.handler.DbfsRequestHandler.receive(DbfsRequestHandler.scala:16)
		at com.databricks.backend.daemon.data.server.handler.DbfsRequestHandler.receive$(DbfsRequestHandler.scala:15)
		at com.databricks.backend.daemon.data.server.handler.MountHandler.receive(MountHandler.scala:39)
		at com.databricks.backend.daemon.data.server.session.SessionContext.$anonfun$queryHandlers$1(SessionContext.scala:51)
	Caused by: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation.
		at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2643)
		at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.open(NativeAzureFileSystem.java:3037)
		at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997)
		at com.databricks.backend.daemon.data.server.DefaultMetadataManager.doReadFile$1(MetadataManager.scala:680)
		... 122 more
	Caused by: com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation.
		at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87)
		at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305)
		at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:196)
		at com.microsoft.azure.storage.blob.CloudBlob.downloadAttributes(CloudBlob.java:1414)
		at shaded.databricks.org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobWrapperImpl.downloadAttributes(StorageInterfaceImpl.java:377)
		at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2582)
		... 125 more
1 ACCEPTED SOLUTION

Accepted Solutions

mzs
Contributor

The problem was actually with DBFS and the internal Databricks-managed storage account firewall, not even with the storage account my catalog is using. The cluster event logs would occasionally show "DBFS is down".

In Terraform, in my azurerm_databricks_workspace resource, I had set default_storage_firewall_enabled = true. This sets up the firewall on the internal storage account and adds subnets from NCC, but not classic compute subnets. To make that work I would need to set up private endpoints for the internal storage account: https://learn.microsoft.com/en-us/azure/databricks/security/network/storage/firewall-support

Since we don't have anything using DBFS explicitly, I turned off DBFS in workspace security settings ("Disable DBFS root and mounts"), and now I'm able to work with files and tables from Unity Catalog in a notebook. The "DBFS is down" messages are gone from the event log as well.

View solution in original post

6 REPLIES 6

rswarnkar5
New Contributor II

What all private endpoint you have on your SA? Check if the PE for DFS exists. 

Hi, I don't have any private endpoints.

I was using service endpoints but at this point I've removed the service endpoints and opened up the storage network restrictions to all public networks, and it still hits the same error.

rswarnkar5
New Contributor II

Hi, I think you are trying a lot of things. Try to isolate RBAC ACCESS issue separate from Network issue.

  How about you first try 

1. Keep all resources (SA, Databricks) public in a sandbox environment. Check if things work. Keep the roles constant.

2. Make changes on single aspect between each try and error.  e.g. Networking, RBAC. Stick to either Private Ep or Service Ep. As mixing them might not be desirable.

3. See if you can produce diagnostic info on error message on Databricks side. 

 

 

2. If that works, then go for full private 

Is there a way to get some diagnostic information from the underlying libraries? Maybe an environment variable I can pass or set on the cluster level, that would show up in the Spark driver or worker logs?

I think that stack trace is from the old Azure storage library (com.microsoft.azure.storage). I'd like to know what endpoint it's calling, and where it's getting a token from (and what kind).

The external location has an associated storage credential with a managed system identity. How does the token from that MSI get provided to the classic compute cluster? Typically the Azure storage SDK would connect to the IMDS endpoint and get an access token based on the virtual machine's managed system identity. But here the MSI is tied to the storage credential object, not the VM. That doesn't seem to be a problem for serverless compute, so maybe there's something in the control plane that is getting a token and passing it through to the cluster.

mzs
Contributor

The problem was actually with DBFS and the internal Databricks-managed storage account firewall, not even with the storage account my catalog is using. The cluster event logs would occasionally show "DBFS is down".

In Terraform, in my azurerm_databricks_workspace resource, I had set default_storage_firewall_enabled = true. This sets up the firewall on the internal storage account and adds subnets from NCC, but not classic compute subnets. To make that work I would need to set up private endpoints for the internal storage account: https://learn.microsoft.com/en-us/azure/databricks/security/network/storage/firewall-support

Since we don't have anything using DBFS explicitly, I turned off DBFS in workspace security settings ("Disable DBFS root and mounts"), and now I'm able to work with files and tables from Unity Catalog in a notebook. The "DBFS is down" messages are gone from the event log as well.

SebastianRowan
Contributor

Could the classic cluster still be using the old ABFS driver instead of the managed identity?