Databricks Community

Filip · ‎08-22-2024

Hi,

I'm trying to figure out if we can switch from Entra ID SPN's to User Assigned Managed Indentities and everything works except I can't figure out how to access the lake files from python notebook.

I've tried with below code and was running it on a cluster as Managed Identity but basically I was getting the same error as when I've tried to run from any differetn cluster:

spark.conf.set("fs.azure.account.auth.type.storageaccount.dfs.core.windows.net", "OAuth")

spark.conf.set("fs.azure.account.oauth.provider.type.storageaccount.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider")

df = spark.read.csv("abfss://mylakecontainer@storageaccount.dfs.core.windows.net/path/to/csv/file.csv")

df.show()

but I get an error:

IllegalArgumentException: Failed to initialize org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider

I could not find any example online where Managed Identity is used to get access to the lake

Then I've decided to try different approach as I've had only reader sole assigned to that Managed Identity so it should be possible to for example print some properties:

from azure.identity import DefaultAzureCredential

from azure.identity import ManagedIdentityCredential

from azure.storage.blob import BlobServiceClient

from azure.mgmt.storage import StorageManagementClient

subscription_id = "111-111-111-111"

resource_group_name = "rmy-rg"

storage_account_name = "mystorage"

account_url = f"https://{storage_account_name}.blob.core.windows.net"

credential = ManagedIdentityCredential()

storage_client = StorageManagementClient(credential, subscription_id)

storage_account = storage_client.storage_accounts.get_properties(

resource_group_name, storage_account_name

)

print("Storage Account Properties:")

print(f"Name: {storage_account.name}")

print(f"Location: {storage_account.location}")

print(f"Kind: {storage_account.kind}")

print(f"SKU: {storage_account.sku.name}")

print(f"Primary Location: {storage_account.primary_location}")

print(f"Status of Primary: {storage_account.status_of_primary}")

And even with my cluster having Access Mode set to "Assigned" and that assignee is the Managed Identity, when I run the above code I get the error:

HttpResponseError: (AuthorizationFailed) The client '1f945563-4de8-44a0-a979-2c4e4540ad4c' with object id '1f945563-4de8-44a0-a979-2c4e4540ad4c' does not have authorization to perform action 'Microsoft.Storage/storageAccounts/read' over scope '/subscriptions/111-111-11/resourceGroups/my-rg/providers/Microsoft.Storage/storageAccounts/mystorage' or the scope is invalid. If access was recently granted, please refresh your credentials.

This client ID 1f945563-4de8-44a0-a979-2c4e4540ad4c is the dbmanagedidentity (enterprise application default one) and not my User Assigned Managed Identity which is added to the workspace as the SPN. Why the job running on that cluser in assigned mode still not using my managed identity?

In the last second, I've decided to use

client_id = "my-umi-client-id"

credential = DefaultAzureCredential(managed_identity_client_id=client_id)

But then I get the error:

ClientAuthenticationError: DefaultAzureCredential failed to retrieve a token from the included credentials.

WARNING:azure.identity._credentials.chained:DefaultAzureCredential failed to retrieve a token from the included credentials.

Attempted credentials:

EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.

Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot this issue.

ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable. The requested identity has not been assigned to this resource. Error: Unexpected response "{'error': 'invalid_request', 'error_description': 'Identity not found'}"

SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.

AzureCliCredential: Azure CLI not found on path

AzurePowerShellCredential: PowerShell is not installed

AzureDeveloperCliCredential: Azure Developer CLI could not be found. Please visit https://aka.ms/azure-dev for installation instructions and then,once installed, authenticate to your Azure account using 'azd auth login'.

To mitigate this issue, please refer to the troubleshooting guidelines here at https://aka.ms/azsdk/python/identity/defaultazurecredential/troubleshoot.

but this managed Identity is assigned to that cluster...

What I'm doing wrong here?

Azure Databricks Service does not have Identity option and if I go to managed resource group of the workspace, there is just plethora of VM's so I guess this is not the correct way to assign User Managed Identity to the cluster...

szymon_dybczak · ‎08-22-2024

Hi @Filip ,

It's obsolete way of configuring access to storage account. Nowadays you should use UC and storage credentials and external location to configure access to storage account.

A storage credential is a securable object representing an Azure managed identity or Microsoft Entra ID service principal. Once a storage credential is created access to it can be granted to principals (users and groups).Storage credentials are primarily used to create external locations, which scope access to a specific storage path

Storage credentials - Azure Databricks - Databricks SQL | Microsoft Learn

Filip · ‎08-22-2024

Yea I'm aware of that UC is fixing that but I'm not on UC yet and wanted to know if it is even possible to use our own user assigned managed identity and assign it instead of using built-in one as it looks it os not really possible for some reason.

szymon_dybczak · ‎08-22-2024

Ok, so unfortunately using User Assigned Managed Identity to read/write from ADLS Gen2 inside a notebook is not directly supported. Your best bet is to use regular service principal or switch to unity catalog.

kuniteru · ‎12-04-2024

Hi,

I can be accessed with the following code.

storageAccountName = "my-storage-account-name"
applicationClientId = "my-umi-client-id"
aadDirectoryId = "my-entra-tenant-id"
containerName = "my-lake-container"

spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type","org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider")
spark.conf.set("fs.azure.account.oauth2.msi.tenant", aadDirectoryId)
spark.conf.set("fs.azure.account.oauth2.client.id", applicationClientId)

df = spark.read.csv("abfss://"+containerName+"@"+storageAccountName+".dfs.core.windows.net/hello.csv")
df.show()

I too would like to change to UC but can't take the time to do so...

Databricks Community

How to Assign User Managed Identity to DBR Cluster so I can use it for quering ADLSv2?

Connect with Databricks Users in Your Area

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure

Securely share data, analytics and AI