12-19-2022 06:05 AM
Checking the docs I noticed this statement under Azure storage access page:
[https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage]
Deprecated patterns for storing and accessing data from Azure Databricks
The following are deprecated storage patterns:
Well, so far I used ADLS v2 mounts (at eg. dbfs:/mnt/datalake) as locations for my databases/schemas:
CREATE SCHEMA foo LOCATION '/mnt/datalake/foo';
Sounds like this is no longer recommended, is it? And remaining methods on the page describe ad-hoc connections - except for Unity catalog external locations, but even that is mentioned as a way to create primarily external tables:
Unity Catalog manages access to data in Azure Data Lake Storage Gen2 using external locations. Administrators primarily use external locations to configure Unity Catalog external tables, but can also delegate access to users or groups using the available privileges (READ FILES, WRITE FILES, and CREATE TABLE).
What about managed tables then? Any guidelines?
Last but not least, why actually are mounts not recommended in the first place?
12-19-2022 07:57 AM
@X X DBFS mount above recommendation is for Unity catalog enabled workspaces, when you enable unity catalog you will access your external data based on external location and storage credential.
whereas normally without UC as far as i know external dbfs mounts are only way to access your external data.
12-19-2022 07:57 AM
@X X DBFS mount above recommendation is for Unity catalog enabled workspaces, when you enable unity catalog you will access your external data based on external location and storage credential.
whereas normally without UC as far as i know external dbfs mounts are only way to access your external data.
12-28-2022 11:16 PM
You don't need to mount to dbfs. The security issue with mounting is that as soon as you've mounted using a credential, everyone with access to the workspace now has access to the data at that mounted location.
The recommended way that isn't mounting to dbfs is to use session scoped connections using provider secret scopes (azure key vault, aws parameter store, etc) and access control lists. This way, you have a service principal/iam role that has access to the storage location, and you control who has access to the secrets for that service principal. I personally put all of my databricks artifacts in repos under the following folder setup:
databricks>notebooks>category>artifact
databricks>functions>category.py
If you were using azure, you could have a function called set_session_scope in databricks.functions.azure and then you could just import the function and pass it the required parameters using from databricks.functions.azure import set_session_scope.
12-29-2022 07:28 AM
> as soon as you've mounted using a credential, everyone with access to the workspace now has access to the data at that mounted location
Is this true even if only using clusters with Table Level Access Control enabled (which should prevent users from directly accessing mounted storage via DButils)?
Honestly I don't see how mounting storage upfront differs from using ad-hoc connection details from security perspective - both approaches require service principal credentials or a SAS, unless using Unity Catalog external locations as karthik_p hinted.
12-29-2022 01:12 PM
If you have the appropriate policies in place where people are unable to create clusters and only clusters with table access control enabled are available then yes, you've essentially made mount points only available to those that have access to clusters that can use them, or Administrators. That would be similar to disabling mounts and only accessing external data through session scoped credentials and managing access to said credentials through access control lists.
The difference in the two methods boil down to the objects we have available to manage access. Access Control Lists, or the Administrator role + table access control clusters.
Mounting by nature makes it so everyone with access to any cluster that can see the mount point. The method around, as you've stated, is to only have table access control enabled and leave only Administrators as able to access file level data. But then you need to promote a principal to Administrator to be able to access file data, which may be unwanted.
Not mounting and only accessing file data through session scoped credentials allows you to use any cluster type. You can then create a secret scope with access control lists and specify which users/groups have access to the scope, controlling who has access to the file data in a least privileged manner.
Cloud provider notes:
12-29-2022 02:56 PM
Thank you for detailed reply. I misunderstood at first what exactly "session scoped credentials" mean, pity that it's not really mentioned anywhere in official docs (or I haven't found it). But I came across a nice summary of pre-unity storage access patterns here: https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks, which explains it.
I only have one more question about this note:
The service principal you use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be granted access to other Azure resources.
Does this suggest that having a mount leaks the secret to the users? And they can just grab it and exploit service principal's permissions on other Azure resources? Or is this just an extra precaution or something?
12-30-2022 11:59 AM
It mostly just follows "least privileged" use case scenarios. Having a mount doesn't necessarily leak the secret, but typically having a mount means you've stored the secret somewhere where users can likely access it. Anywho, "least privileged" refers to when you want to do a thing and need a permission or role, that permission or role should have access to just that thing you need to do and nothing else. This way, even if it does get leaked, the impact is lower than if say it was a global admin.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group