2 weeks ago - last edited 2 weeks ago
Hi,
I'm setting up a workspace in Azure with VNet injection. I'm able to upload files to a Unity Catalog managed storage account volume through the web UI, and access them from notebooks using serverless compute, for example, `dbutils.fs.list("/Volumes/mycatalog/myschema/myvolume")`.
The same dbutils.fs.list() call fails from a classic all-purpose compute cluster with `com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation`. I get the same issue for `spark.read.csv(...)` for a path on the volume.
Some points about the setup:
Things I've tried that have not helped:
The next thing I'll try will be enabling a private endpoint on the storage account, but I'd rather not because it seems like it should work with service endpoints (ref) and would that would avoid needless bandwidth charges. Has anyone run across this before?
Below is more of the exception:
com.databricks.rpc.UnknownRemoteException: Remote exception occurred:
com.databricks.backend.daemon.data.server.FailedOperationAttemptException: Metadata operation failed
at com.databricks.backend.daemon.data.server.DefaultMetadataManager.doReadFile$1(MetadataManager.scala:735)
at com.databricks.backend.daemon.data.server.DefaultMetadataManager.$anonfun$readMountFile$8(MetadataManager.scala:791)
at com.databricks.backend.daemon.data.server.DefaultMetadataManager.withRetries(MetadataManager.scala:1032)
at com.databricks.backend.daemon.data.server.DefaultMetadataManager.$anonfun$readMountFile$6(MetadataManager.scala:791)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.data.server.DefaultMetadataManager.readMountFile(MetadataManager.scala:791)
at com.databricks.backend.daemon.data.server.DefaultMetadataManager.getMountFileState(MetadataManager.scala:631)
at com.databricks.backend.daemon.data.server.DefaultMetadataManager.getMounts(MetadataManager.scala:835)
at com.databricks.backend.daemon.data.server.handler.MountsGetHandler.receive(MountsGetHandler.scala:31)
at com.databricks.backend.daemon.data.server.handler.MountHandler.receive(MountHandler.scala:104)
at com.databricks.backend.daemon.data.server.handler.DbfsRequestHandler.receive(DbfsRequestHandler.scala:16)
at com.databricks.backend.daemon.data.server.handler.DbfsRequestHandler.receive$(DbfsRequestHandler.scala:15)
at com.databricks.backend.daemon.data.server.handler.MountHandler.receive(MountHandler.scala:39)
at com.databricks.backend.daemon.data.server.session.SessionContext.$anonfun$queryHandlers$1(SessionContext.scala:51)
Caused by: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation.
at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2643)
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.open(NativeAzureFileSystem.java:3037)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997)
at com.databricks.backend.daemon.data.server.DefaultMetadataManager.doReadFile$1(MetadataManager.scala:680)
... 122 more
Caused by: com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation.
at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87)
at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305)
at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:196)
at com.microsoft.azure.storage.blob.CloudBlob.downloadAttributes(CloudBlob.java:1414)
at shaded.databricks.org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobWrapperImpl.downloadAttributes(StorageInterfaceImpl.java:377)
at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2582)
... 125 more
a week ago
The problem was actually with DBFS and the internal Databricks-managed storage account firewall, not even with the storage account my catalog is using. The cluster event logs would occasionally show "DBFS is down".
In Terraform, in my azurerm_databricks_workspace resource, I had set default_storage_firewall_enabled = true. This sets up the firewall on the internal storage account and adds subnets from NCC, but not classic compute subnets. To make that work I would need to set up private endpoints for the internal storage account: https://learn.microsoft.com/en-us/azure/databricks/security/network/storage/firewall-support
Since we don't have anything using DBFS explicitly, I turned off DBFS in workspace security settings ("Disable DBFS root and mounts"), and now I'm able to work with files and tables from Unity Catalog in a notebook. The "DBFS is down" messages are gone from the event log as well.
2 weeks ago
What all private endpoint you have on your SA? Check if the PE for DFS exists.
2 weeks ago
Hi, I don't have any private endpoints.
I was using service endpoints but at this point I've removed the service endpoints and opened up the storage network restrictions to all public networks, and it still hits the same error.
2 weeks ago
Hi, I think you are trying a lot of things. Try to isolate RBAC ACCESS issue separate from Network issue.
How about you first try
1. Keep all resources (SA, Databricks) public in a sandbox environment. Check if things work. Keep the roles constant.
2. Make changes on single aspect between each try and error. e.g. Networking, RBAC. Stick to either Private Ep or Service Ep. As mixing them might not be desirable.
3. See if you can produce diagnostic info on error message on Databricks side.
2. If that works, then go for full private
2 weeks ago
Is there a way to get some diagnostic information from the underlying libraries? Maybe an environment variable I can pass or set on the cluster level, that would show up in the Spark driver or worker logs?
I think that stack trace is from the old Azure storage library (com.microsoft.azure.storage). I'd like to know what endpoint it's calling, and where it's getting a token from (and what kind).
The external location has an associated storage credential with a managed system identity. How does the token from that MSI get provided to the classic compute cluster? Typically the Azure storage SDK would connect to the IMDS endpoint and get an access token based on the virtual machine's managed system identity. But here the MSI is tied to the storage credential object, not the VM. That doesn't seem to be a problem for serverless compute, so maybe there's something in the control plane that is getting a token and passing it through to the cluster.
a week ago
The problem was actually with DBFS and the internal Databricks-managed storage account firewall, not even with the storage account my catalog is using. The cluster event logs would occasionally show "DBFS is down".
In Terraform, in my azurerm_databricks_workspace resource, I had set default_storage_firewall_enabled = true. This sets up the firewall on the internal storage account and adds subnets from NCC, but not classic compute subnets. To make that work I would need to set up private endpoints for the internal storage account: https://learn.microsoft.com/en-us/azure/databricks/security/network/storage/firewall-support
Since we don't have anything using DBFS explicitly, I turned off DBFS in workspace security settings ("Disable DBFS root and mounts"), and now I'm able to work with files and tables from Unity Catalog in a notebook. The "DBFS is down" messages are gone from the event log as well.
a week ago
Could the classic cluster still be using the old ABFS driver instead of the managed identity?
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now