cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

UnauthorizedAccessException: PERMISSION_DENIED: User does not have READ FILES on External Location

kDev
New Contributor

Our jobs have been running fine so far w/o any issues on a specific workspace. These jobs read data from files on Azure ADLS storage containers and dont use the hive metastore data at all.

Now we attached the unity metastore to this workspace, created necessary storage credentials, configured the storage locations, providing permissions to the specific User group/service principal. The jobs started to fail with the "UnauthorizedAccessException: PERMISSION_DENIED: User does not have READ FILES on External Location " error message.

We removed/deleted the storage location configuration on UNity and tis time the jobs executed successfully.

What could be missing here?

8 REPLIES 8

Anonymous
Not applicable

@kumar mahadevanโ€‹ :

Based on the error message you received, it seems like the user or service principal that is running the Databricks job does not have the necessary read permissions on the Azure ADLS storage containers.

First, double-check that the user or service principal has been granted the appropriate permissions to access the storage containers. This can be done through the Azure portal or using Azure CLI.

Second, ensure that the credentials for accessing the storage containers are properly configured in the Databricks workspace. You may want to check the Databricks Secret Scopes to make sure the correct secrets are being used and that they have not expired.

Finally, ensure that the correct path and permissions are being used in the job code when accessing the storage containers. For example, you may need to specify the correct directory or file permissions to access the data.

Masha
New Contributor III

Hello @kDev  were you able to solve this issue? I have now the same issue and seems like I already tried everything... 

Wojciech_BUK
Valued Contributor III

Ok, so there is the thing:

1. If you were not using Unity Catalog first, that means you have used totaly diffrent approach e.g. mouns.
If yes, you were accessing storage via dbfs path.

Once you switched to Unity Catalog you need to take cate of few things:
- the one you did - create Databricks access connector, assign blob contributor role to access connector (IMPORTANT: allow workspace clusters to access storage), create storage credentials, create external location

- as you mentioned you need to grant READ FILES permission over External Location for Service Principal that will be executing READ operation from you Storage Account 

- you also need to make sure clusters that are being used to execute the job support Unity Catalog and you don't use dbfs path but insted use same path (pattern) as you placed in External Location  

abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path_to_file>


Remember: that if you run Workflow Job the Run As principal is the one that will try to "access" External location if not specified that will be Job onwer.

 

Masha
New Contributor III

@Wojciech_BUK Thanks a lot for the feedback! I have a couple of questions: 

When you say "allow workspace clusters to access storage" - I understand when you talk about interactive cluster. In my case I was trying to trigger a Databricks Notebook/Job from the Azure Data Factory. 

So I have the ADF, which has a System assigned Managed Identity, and I also created a User Assigned MI when trying to troubleshoot. 

To both of those identities: 

1. Azure Blob Data Storage Contributor on the Storage Account

2. ALL PRIVILEGES on the External location

3. ALL PRIVILEGES on the Storage Credential

4. Allo cluster creation/Databricks SQL access/Workspace access in the Workspace settings/Identity and access/Service principals

5. I added both of these service principals to the "admin" group in Workspace settings/Identity and access/Groups , which has all possible Entitlements: Workspace access/Databricks SQL access/Allow unrestricted cluster creation/allow-instance-pool-create

6. ALL PRIVILEGES on the catalog in the Catalog Explorer

I trigger the Notebook via a Databricks LInked service with "New Job Cluster" and I have configured

13.3.x-cpu-ml-scala2.12 as cluster version 
and 
 Standard_DS3_v2 as cluster type

(which is exactly the same configuration as of the Interactive Cluster that I have and that runs the Unity Catalog related notebook code just fine). 

And yes, I am using the abfss. 

 

Any idea what else I could have missed?

 

 

 

Wojciech_BUK
Valued Contributor III

@Masha in your case:

  • No - I am talking about all types of clusters: Job, Interactive, SQL Warehouse and serverless - regardless which one you use, you need to ensure network connectivity is open for those clusters.
    NOTE: if you are on DEV Env - you can disable Firewall on storage (Allow public network access to your storage)
  • ADF Managed identity need to only have privilege to "Manage Runs" on Workflow Job you have created

You should not give Azure Blob Data Storage Contributor Role to any identity other then Databricks Access Connector.
You should not give ANY privilege to ANY identity over Storage Credentials.

If your Notebook is only design to Read files from Storage Account you only need to grant "READ FILES" over External Location that overlaps with your abfss path, and Identity that needs that privilige is either JOB OWNER or JOB "RUN AS" Identity.

 

This is important which TASK on Data Factory you use, if you are running Existing JOB - you are doing it via REAS API CALL, then what i mentioned above is true.

If you are doing Databricks Notebook Activity you need to grant ADF Identity privilege on our Workspace Notebook ( i don't remember what is minimum privilege level ) and also privilege over External Location, because entire notebook will be executed in context of ADF Identity. 

My suggestion would be to :

  1. Create notebook in SHARED folder in Workspace, that only has sleep command in python and try to execute that successfully from ADF.
  2. Once above is done you can add spark.read.csv("your_abfss_ext_location_path_to_file") and check if you can execute that from ADF.
    Alternatively you can create Job in Workflow on interactive cluster and set RUN AS to ADF Service principal - it will speedup debugging process. 

Please also execute the following test (each step has to succeed) :

  • Go to External Location you have created
  • Click BROWSE 
  • Go to file location that you want to read (sometimes error will be thrown there, and that means something is wrong with External location setup)
  • Copy file path from UI
  • Read this file from interactive notebook with your privilege (read file and display it - make sure spark action is executed)
  • Create Job (Workflow) that will execute above notebook and in section RUN AS select ADF Identity 

I encourage to run from ADF Workflow Jobs rather than execute notebook, so the last thing to do will be to give ADF Identity privilege to run this JOB.

If you still prefer to run notebook from ADF, you will need to make sure ADF identity can create Job runs and have access to this notebook, if you placed it in SHARED folder, it will have it by default.

Additional consideration: 
I saw multiple times that people made mistakes when creating PRINCIPAL in  Databricks Workspace and made mistake when Copying Object ID instead of Principal ID ๐Ÿ™‚

Masha
New Contributor III

@Wojciech_BUK  thanks a lot for all your advice! I solve the issue after all, and the solution was ridiculously easy as well as ridiculously not obvious (to me at least).

So the thing was that I was adding my Data Factory managed identity to the External Location permissions using it's display name. And all I had to do instead is I had to add it using it's "Managed Identity Application ID".. 

Masha_0-1715355624395.png

I am not sure if that's default behavior or if I configured something wrong somewhere else to come to this. But anyway now it works!

Wojciech_BUK
Valued Contributor III

Cool, I am happy it worked.
It would be extremely hard to find it out without looking into your env but to be honest I struggled with those identities, principals and so on, so good that solution will be on this forum.

I don't know how you managed to grant permission on name instead of ID, I was not able to do that via GRANT Statement ๐Ÿ™‚

I find this difficult to manage Grants with those ID's so I am always adding Principals to Group e.g. "SP-Process-Orchestrator" and then grant privilege to Group ๐Ÿ™‚ 

Masha
New Contributor III

@Wojciech_BUK  I granted both in the GUI:) you can either search for display name there (and then it uses the Managed Identity Object ID), or you can search directly for the value of the "Managed Identity Application ID" and then it works correctly! 

Masha_0-1715581689418.png

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group