cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

DBFS mounts no longer recommended?

kmehkeri
New Contributor III

Checking the docs I noticed this statement under Azure storage access page:

[https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage]

Deprecated patterns for storing and accessing data from Azure Databricks

The following are deprecated storage patterns:

Well, so far I used ADLS v2 mounts (at eg. dbfs:/mnt/datalake) as locations for my databases/schemas:

CREATE SCHEMA foo LOCATION '/mnt/datalake/foo';

Sounds like this is no longer recommended, is it? And remaining methods on the page describe ad-hoc connections - except for Unity catalog external locations, but even that is mentioned as a way to create primarily external tables:

Unity Catalog manages access to data in Azure Data Lake Storage Gen2 using external locations. Administrators primarily use external locations to configure Unity Catalog external tables, but can also delegate access to users or groups using the available privileges (READ FILES, WRITE FILES, and CREATE TABLE).

What about managed tables then? Any guidelines?

Last but not least, why actually are mounts not recommended in the first place?

1 ACCEPTED SOLUTION

Accepted Solutions

karthik_p
Esteemed Contributor

@X X​ DBFS mount above recommendation is for Unity catalog enabled workspaces, when you enable unity catalog you will access your external data based on external location and storage credential.

whereas normally without UC as far as i know external dbfs mounts are only way to access your external data.

View solution in original post

6 REPLIES 6

karthik_p
Esteemed Contributor

@X X​ DBFS mount above recommendation is for Unity catalog enabled workspaces, when you enable unity catalog you will access your external data based on external location and storage credential.

whereas normally without UC as far as i know external dbfs mounts are only way to access your external data.

Jfoxyyc
Valued Contributor

You don't need to mount to dbfs. The security issue with mounting is that as soon as you've mounted using a credential, everyone with access to the workspace now has access to the data at that mounted location.

The recommended way that isn't mounting to dbfs is to use session scoped connections using provider secret scopes (azure key vault, aws parameter store, etc) and access control lists. This way, you have a service principal/iam role that has access to the storage location, and you control who has access to the secrets for that service principal. I personally put all of my databricks artifacts in repos under the following folder setup:

databricks>notebooks>category>artifact

databricks>functions>category.py

If you were using azure, you could have a function called set_session_scope in databricks.functions.azure and then you could just import the function and pass it the required parameters using from databricks.functions.azure import set_session_scope.

kmehkeri
New Contributor III

> as soon as you've mounted using a credential, everyone with access to the workspace now has access to the data at that mounted location

Is this true even if only using clusters with Table Level Access Control enabled (which should prevent users from directly accessing mounted storage via DButils)?

Honestly I don't see how mounting storage upfront differs from using ad-hoc connection details from security perspective - both approaches require service principal credentials or a SAS, unless using Unity Catalog external locations as karthik_p hinted.

Jfoxyyc
Valued Contributor

If you have the appropriate policies in place where people are unable to create clusters and only clusters with table access control enabled are available then yes, you've essentially made mount points only available to those that have access to clusters that can use them, or Administrators. That would be similar to disabling mounts and only accessing external data through session scoped credentials and managing access to said credentials through access control lists.

The difference in the two methods boil down to the objects we have available to manage access. Access Control Lists, or the Administrator role + table access control clusters.

Mounting by nature makes it so everyone with access to any cluster that can see the mount point. The method around, as you've stated, is to only have table access control enabled and leave only Administrators as able to access file level data. But then you need to promote a principal to Administrator to be able to access file data, which may be unwanted.

Not mounting and only accessing file data through session scoped credentials allows you to use any cluster type. You can then create a secret scope with access control lists and specify which users/groups have access to the scope, controlling who has access to the file data in a least privileged manner.

Cloud provider notes:

  • All users in the Azure Databricks workspace have access to the mounted ADLS Gen2 account. The service principal you use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be granted access to other Azure resources.
  • When you create a mount point through a cluster, cluster users can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.
  • Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount storage as part of processing.
  • Mount points that use secrets are not automatically refreshed. If mounted storage relies on a secret that is rotated, expires, or is deleted, errors can occur, such as 
  • 401 Unauthorized. To resolve such an error, you must unmount and remount the storage.

kmehkeri
New Contributor III

Thank you for detailed reply. I misunderstood at first what exactly "session scoped credentials" mean, pity that it's not really mentioned anywhere in official docs (or I haven't found it). But I came across a nice summary of pre-unity storage access patterns here: https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks, which explains it.

I only have one more question about this note:

The service principal you use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be granted access to other Azure resources.

Does this suggest that having a mount leaks the secret to the users? And they can just grab it and exploit service principal's permissions on other Azure resources? Or is this just an extra precaution or something?

Jfoxyyc
Valued Contributor

It mostly just follows "least privileged" use case scenarios. Having a mount doesn't necessarily leak the secret, but typically having a mount means you've stored the secret somewhere where users can likely access it. Anywho, "least privileged" refers to when you want to do a thing and need a permission or role, that permission or role should have access to just that thing you need to do and nothing else. This way, even if it does get leaked, the impact is lower than if say it was a global admin.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.