cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to Assign User Managed Identity to DBR Cluster so I can use it for quering ADLSv2?

Filip
New Contributor II
Hi,

I'm trying to figure out if we can switch from Entra ID SPN's to User Assigned Managed Indentities and everything works except I can't figure out how to access the lake files from python notebook.

I've tried with below code and was running it on a cluster as Managed Identity but basically I was getting the same error as when I've tried to run from any differetn cluster:



spark.conf.set("fs.azure.account.auth.type.storageaccount.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.storageaccount.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider")

df = spark.read.csv("abfss://mylakecontainer@storageaccount.dfs.core.windows.net/path/to/csv/file.csv")
df.show()
but I get an error:
IllegalArgumentException: Failed to initialize org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider

I could not find any example online where Managed Identity is used to get access to the lake

Then I've decided to try different approach as I've had only reader sole assigned to that Managed Identity so it should be possible to for example print some properties:



from azure.identity import DefaultAzureCredential
from azure.identity import ManagedIdentityCredential
from azure.storage.blob import BlobServiceClient
from azure.mgmt.storage import StorageManagementClient

subscription_id = "111-111-111-111"
resource_group_name = "rmy-rg"
storage_account_name = "mystorage"

account_url = f"https://{storage_account_name}.blob.core.windows.net"

credential = ManagedIdentityCredential()
storage_client = StorageManagementClient(credential, subscription_id)
storage_account = storage_client.storage_accounts.get_properties(
resource_group_name, storage_account_name
)

print("Storage Account Properties:")
print(f"Name: {storage_account.name}")
print(f"Location: {storage_account.location}")
print(f"Kind: {storage_account.kind}")
print(f"SKU: {storage_account.sku.name}")
print(f"Primary Location: {storage_account.primary_location}")
print(f"Status of Primary: {storage_account.status_of_primary}")


And even with my cluster having Access Mode set to "Assigned" and that assignee is the Managed Identity, when I run the above code I get the error:

HttpResponseError: (AuthorizationFailed) The client '1f945563-4de8-44a0-a979-2c4e4540ad4c' with object id '1f945563-4de8-44a0-a979-2c4e4540ad4c' does not have authorization to perform action 'Microsoft.Storage/storageAccounts/read' over scope '/subscriptions/111-111-11/resourceGroups/my-rg/providers/Microsoft.Storage/storageAccounts/mystorage' or the scope is invalid. If access was recently granted, please refresh your credentials.


This client ID 1f945563-4de8-44a0-a979-2c4e4540ad4c is the dbmanagedidentity (enterprise application default one) and not my User Assigned Managed Identity which is added to the workspace as the SPN. Why the job running on that cluser in assigned mode still not using my managed identity? 

In the last  second, I've decided to use



client_id = "my-umi-client-id"
credential = DefaultAzureCredential(managed_identity_client_id=client_id)
But then I get the error:
ClientAuthenticationError: DefaultAzureCredential failed to retrieve a token from the included credentials.


WARNING:azure.identity._credentials.chained:DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable. The requested identity has not been assigned to this resource. Error: Unexpected response "{'error': 'invalid_request', 'error_description': 'Identity not found'}"
SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.
AzureCliCredential: Azure CLI not found on path
AzurePowerShellCredential: PowerShell is not installed
AzureDeveloperCliCredential: Azure Developer CLI could not be found. Please visit https://aka.ms/azure-dev for installation instructions and then,once installed, authenticate to your Azure account using 'azd auth login'.
To mitigate this issue, please refer to the troubleshooting guidelines here at https://aka.ms/azsdk/python/identity/defaultazurecredential/troubleshoot.
but this managed Identity is assigned to that cluster...
What I'm doing wrong here? 
Azure Databricks Service does not have Identity option and if I go to managed resource group of the workspace, there is just plethora of VM's so I guess this is not the correct way to assign User Managed Identity to the cluster...
4 REPLIES 4

szymon_dybczak
Contributor III

Hi @Filip ,

It's obsolete way of configuring access to storage account. Nowadays you should use UC and storage credentials and external location to configure access to storage account. 

A storage credential is a securable object representing an Azure managed identity or Microsoft Entra ID service principal. Once a storage credential is created access to it can be granted to principals (users and groups).Storage credentials are primarily used to create external locations, which scope access to a specific storage path



Storage credentials - Azure Databricks - Databricks SQL | Microsoft Learn

Yea I'm aware of that UC is fixing that but I'm not on UC yet and wanted to know if it is even possible to use our own user assigned managed identity and assign it instead of using built-in one as it looks it os not really possible for some reason.

Ok, so unfortunately using User Assigned Managed Identity to read/write from ADLS Gen2 inside a notebook is not directly supported. Your best bet is to use regular service principal or switch to unity catalog.

kuniteru
New Contributor II

Hi,

I can be accessed with the following code.

storageAccountName = "my-storage-account-name"
applicationClientId = "my-umi-client-id"
aadDirectoryId = "my-entra-tenant-id"
containerName = "my-lake-container"

spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type","org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider")
spark.conf.set("fs.azure.account.oauth2.msi.tenant", aadDirectoryId)
spark.conf.set("fs.azure.account.oauth2.client.id", applicationClientId)

df = spark.read.csv("abfss://"+containerName+"@"+storageAccountName+".dfs.core.windows.net/hello.csv")
df.show()

I too would like to change to UC but can't take the time to do so...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group