Databricks

Gerhard · ‎01-26-2022

What we have:

Databricks Workspace Premium on Azure

ADLS Gen2 storage for raw data, processed data (tables) and files like CSV, models, etc.

What we want to do:

We have users that want to work on Databricks to create and work with Python algorithms. We do have central tables were raw data gets ingested (by jobs) and is used by the users to feed the algorithms.

These central tables are maintained by the administrators and should only be readable by the users. In the process of the development the users also need to work with other files (e.g. CSV, model files, etc.) and therefore need a access to files stored on ADLS.

Now the users really like the ability to switch between SQL, Python and so on. Extract the needed data from the central tables with SQL in a DF and convert to Pandas do their stuff and potentially bring it back to a table.

The main problem is that I cannot combine the Table Access Control and the Credential Passthrough (Limitations). The users should be on one hand able to only see and query tables they have access to (not UPDATE, DELETE, DROP, etc.), on the other hand they should be able to freely work with files that are in another area (container) of the ADLS.

I have now spend quite some time to get my head around the options to limit access within the workspace in order to separate different groups of users and their respective access. However I have not found the ideal solution, yet.

This is what I have tried.

Table Access Control

Limits the access on certain tables/views/etc. but work only on HC clusters and further access to files is only possible with a mount using a service principal. All other users will also have access to the data. -> not an option

Credential passthrough

Works and limits the access of an individual user on a mount as well as directly (abfss://...). I can create a DB with MANAGED tables that use the mount as LOCATION. -> great but

Jobs cannot impersonate and therefore access the mount and therefore the table. I have tried to add a service principal user (scim-sp) and change the job owner to the service principal. But that did not work. The only option I see is to create the tables not on a mount but directly with a LOCATION abfss://... But ideally I would like to work with MANAGED Tables that use a mount as a source -> anyway

Problem here is that the users see databases and tables that they potentially do not have access (in the ADLS). And worse they can drop tables and DBs (not the underlying data though). And worse worse they could create a table with the same name but in another location and therefore screw up other processes/jobs that run with data from those tables.

JDBC/ODBC Connection

FYI: Tables that are MANAGED and located on a mount with credential passthrough can not be accessed via JDBC. They have to be located with abfss:// and the service principal key configuration (see best practices) has to be in the cluster spark config.

So this is my situation, did I miss some option here. Any ideas are welcome.

Links:

Best Practices: https://docs.microsoft.com/en-us/azure/databricks/security/data-governance

Table access control: https://docs.microsoft.com/en-us/azure/databricks/security/access-control/table-acls/

Credential passthrough: https://docs.microsoft.com/en-us/azure/databricks/security/credential-passthrough/adls-passthrough

Hubert-Dudek · ‎01-27-2022

It is a bit pain of currently. I hope that unity catalog will solve your issues but it is still not available (you can sign for waitlist) https://databricks.com/product/unity-catalog

Gerhard · ‎01-27-2022

Hi Hubert, thanks for this link. I was not aware of this new product that is yet to come. In the meantime I hope to find a intermediate solution.

RicksDB · ‎01-28-2022

We had the same issue and never could find any combination that satisfied our needs.

We had to separate the use cases in 2 different workspaces as mention here:

https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/data-management/best-pract...

The "analytical" workspace still have the issue that users "could" drop tables since only credential passthrough can be used and no security is on the hive. However, at least it doesn't interfere with job processes.

Also, being in 2 different workspaces, we now required a job (launch by an Azure Devops Pipeline) to create those "hive table" in the analytical workspace as well (Obviously, Not ideal). We didn't want to bother with an "external" hive.

We were also told that the unity catalog could "potentially" fix this (didn't seem 100% sure). However, there is no release date as far as I know.

Gerhard · ‎01-30-2022

Hi Rick, thanks for your answer. I think it would already tremendously help if a mount (with a Service Principal) could be limited to a cluster. Then you could have "backend" and "frontend" clusters.

So far my solution is to use credential-passthrough and have the Hive Metastore as the weak spot. Jobs could recreate the metadata that need those tables that might got dropped.

RicksDB · ‎01-31-2022

Indeed, there isn't much missing to meet our security requirements. In our case, the Hive is also our weak spot. Fortunately, interactive (credential passthrough) are contained in a different workspace so there is no impact on our "jobs" if somebody delete a table by error. The engineering/job workspace has an SDLC (dev/staging/prod environment) so it mitigate the risks

It would be great if someone from Databricks could confirm if unity catalog indeed fixes those issues.

Alexander1 · ‎01-31-2022

We have similar issues. Just two remarks. How can you have managed tables in a mount? I thought these are limited to the hive metastore in the default location. Second, are you aware that you can have a mount with credential passthrough? That means an AD user with access to the location can create a mount and other users can only access (pass through) if their AD identity also has access to that location. Biggest drawback is that access is still logged as being performed by original mounting AD account which is unfortunate. Otherwise we also wait for Unity. Especially as Databricks SQL access is currently separate from regular access control so you basically need two access control concepts which is a pain.

Hubert-Dudek · ‎02-01-2022

"How can you have managed tables in a mount?" both database and table can have set location. Additionally what is in hive metastore is in database - usually Aws Rds Mysql managed by databricks (location can be previewed in logs). You can also set own hive metastore in databricks in sql database (maria, postgres, ms-sql) and than can put there whatever you want 🙂

but back to question it is big problem and databricks is aware as we can see in that video (recommended to watch) https://www.youtube.com/watch?v=aRMfxPZxnfc

Anonymous · ‎05-10-2022

Hey there @Gerhard Faller

Hope you are doing well!

Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.

Thanks!

Gerhard · ‎05-10-2022

Hey @Vartika Nain ,

we are still at the same situation as described above. The Hive Metastore is a weak point.

I would love to have the functionality that a mount can be dedicated to a given cluster.

Regards, Gerhard

Databricks

Overall security/access rights concept needed (combine Table Access Control and Credential Passthrough), how to allow users the benefits of both worlds

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI