Hi,
I'm trying to figure out if we can switch from Entra ID SPN's to User Assigned Managed Indentities and everything works except I can't figure out how to access the lake files from python notebook.
I've tried with below code and was running it on a cluster as Managed Identity but basically I was getting the same error as when I've tried to run from any differetn cluster:
spark.conf.set("fs.azure.account.auth.type.storageaccount.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.storageaccount.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider")
df = spark.read.csv("abfss://mylakecontainer@storageaccount.dfs.core.windows.net/path/to/csv/file.csv")
df.show()
but I get an error:
IllegalArgumentException: Failed to initialize org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider
I could not find any example online where Managed Identity is used to get access to the lake
Then I've decided to try different approach as I've had only reader sole assigned to that Managed Identity so it should be possible to for example print some properties:
from azure.identity import DefaultAzureCredential
from azure.identity import ManagedIdentityCredential
from azure.storage.blob import BlobServiceClient
from azure.mgmt.storage import StorageManagementClient
subscription_id = "111-111-111-111"
resource_group_name = "rmy-rg"
storage_account_name = "mystorage"
account_url = f"https://{storage_account_name}.blob.core.windows.net"
credential = ManagedIdentityCredential()
storage_client = StorageManagementClient(credential, subscription_id)
storage_account = storage_client.storage_accounts.get_properties(
resource_group_name, storage_account_name
)
print("Storage Account Properties:")
print(f"Name: {storage_account.name}")
print(f"Location: {storage_account.location}")
print(f"Kind: {storage_account.kind}")
print(f"SKU: {storage_account.sku.name}")
print(f"Primary Location: {storage_account.primary_location}")
print(f"Status of Primary: {storage_account.status_of_primary}")
And even with my cluster having Access Mode set to "Assigned" and that assignee is the Managed Identity, when I run the above code I get the error:
HttpResponseError: (AuthorizationFailed) The client '1f945563-4de8-44a0-a979-2c4e4540ad4c' with object id '1f945563-4de8-44a0-a979-2c4e4540ad4c' does not have authorization to perform action 'Microsoft.Storage/storageAccounts/read' over scope '/subscriptions/111-111-11/resourceGroups/my-rg/providers/Microsoft.Storage/storageAccounts/mystorage' or the scope is invalid. If access was recently granted, please refresh your credentials.
This client ID 1f945563-4de8-44a0-a979-2c4e4540ad4c is the dbmanagedidentity (enterprise application default one) and not my User Assigned Managed Identity which is added to the workspace as the SPN. Why the job running on that cluser in assigned mode still not using my managed identity?
In the last second, I've decided to use
client_id = "my-umi-client-id"
credential = DefaultAzureCredential(managed_identity_client_id=client_id)
But then I get the error:
ClientAuthenticationError: DefaultAzureCredential failed to retrieve a token from the included credentials.
WARNING:azure.identity._credentials.chained:DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable. The requested identity has not been assigned to this resource. Error: Unexpected response "{'error': 'invalid_request', 'error_description': 'Identity not found'}"
SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.
AzureCliCredential: Azure CLI not found on path
AzurePowerShellCredential: PowerShell is not installed
AzureDeveloperCliCredential: Azure Developer CLI could not be found. Please visit
https://aka.ms/azure-dev for installation instructions and then,once installed, authenticate to your Azure account using 'azd auth login'.
but this managed Identity is assigned to that cluster...
What I'm doing wrong here?
Azure Databricks Service does not have Identity option and if I go to managed resource group of the workspace, there is just plethora of VM's so I guess this is not the correct way to assign User Managed Identity to the cluster...