Databricks

Carsten_K · ‎08-30-2022

Hi Guys,

I'd like to know how you have adopted the Unity Catalog. To me it seems rather limited in terms of that you are only allowed to have one UC per region. We are building out a data warehouse and we have three workspaces - dev, test and prod. The naming standard that we thought we could use where catalog.sourcesystem_name.table when for instance staging our source data. As I read the documentation we would need to add a suffix/prefix somewhere to able to distinguish between which environment we are using since all data now is visible across workspaces.

I would like to get some input how you have implemented the UC and are using different workspaces for different purposes.

Br,

Carsten

Pat · ‎08-31-2022

Hi Carsten,

in that case, maybe you could create multiple catalogues in the one metastore:

dev.source_system_name_1.table_1

dev.source_system_name_2.table_1

...

test.source_system_name_1.table_1

test.source_system_name_2.table_1

...

prod.source_system_name_1.table_1

prod.source_system_name_2.table_1

There is an interesting video from this year's summit: https://youtu.be/ibvG-pYKl8U?t=852

"that you are only allowed to have one UC per region" <-- I think this is recommended approach, but you should be able to create multiple metastores in one region. I think about testing this approach, to have DEV and PROD in the same region - different buckets. Another option could be creating DEV in a different region.

I think the limitation here with this approach is that you will have to create all the managed tables in only one bucket - assigned to the UC.

~~"since all data now is visible across workspaces." - this can be limited by creating multiple roles.~~

~~You can have dev-data-eng, test-data-eng, and prod-data-eng roles created at the account level then you bring only dev-data-eng into your dev workspace, and so on.~~

The limitation here is that super-user will still be able to access all the data unless you won't allow Unity Catalog admin access to the workspace (not sure if this is possible, I am checking this now).

I've got a bit confused about this and I've mixed both workspace account privileges and data privileges.

Carsten_K · ‎09-04-2022

Thanks for the share of the presentation. We have used it as input on how to design our UC setup. One other limitation is now that if we want to use the recommended approach of managed tables, then we are forced to use one storage account for all three environments. This we cannot accept and therefore is forced to go with the external tables.

Pat · ‎09-05-2022

Hi @Carsten Klausman ,

this is a problem for me as well, I am not able to store all the data in the one bucket.

I will have to use both EXTERNAL and MANAGED tables.

Being able to see all the data from the each workspace it's bit of a pain for me as well, I want to isolate some workspaces to allow to read only the specific data, I might need to go with both UC enabled workspaces and non-UC.

-werners- · ‎09-13-2022

I am looking into Unity too. I'd say it is both great and limited, but more limited than great.

Great because you have some very interesting features like column/row-based access, and lineage.

But it is still very limited because on the heavy focus on tables and delta lake.

They kinda seem to have forgotten that A LOT of data still resides in common parquet files.

f.e.

Overwrite mode for DataFrame write operations into Unity Catalog is supported only for Delta tables, not for other file formats.

This alone makes me wonder if we should use it, or look into DataHub/Amundsen.

It is still a new product, so new features might get added, but right now I probably will not be using it.

prasadvaze · ‎12-24-2022

@Pat Sienkiewicz and @Carsten Klausman I am implementing Unity and indeed i had to use prefix to separate dev, test , prod catalogs in 1 unity metastore because all my workspaces are in 1 azure region so I can only create 1 metastore. This is not a limitation though as our team has to follow this new standard for naming catalogs. Then on prod catalog, team has read-only permission. The biggest limitation ( and @Kaniz Fatma I have submitted a feature request to databricks ) so far is not being able to differentiate between DML ( insert /update/ delete) and DDL permission on schema or table. Granting modify means user can change table schema and table data. Ideally modify should limit to DDL permission AND there should be 3 separate permissions (insert /update/ delete ) like in relational db but I understand it's not easy to implement this on parquet files. I am optimistic that soon databricks will bring this feature @Vivian Wilfred

Kaniz · ‎09-02-2022

Hi @Carsten Klausman , We haven't heard from you on the last response from @Pat Sienkiewicz, and I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please share it with the community as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Pat · ‎10-28-2022

Hi @Carsten Klausman ,

How is your journey with the Unity Catalog?

I am seeing more and more limitations recently. It's pity, we had this freedom to work with Spark - ACID and optimized.

We are being forced to do things with the Unity Catalog, and seasoned Data Engineers have to talk to support to learn that some things doesn't work.

Carsten_K · ‎10-31-2022

Hi @Pat Sienkiewicz

It is not a smooth ride. The whole permission model is not optimal. We had to add SQL statements that alters the permissions into our code. Optimized? - Do you mean the managed tables. We didn't go that route due to the limitations of using only one storage account and we want to separate our environments.

Pat · ‎11-01-2022

Hi @Carsten Klausman ,

the one storage account it's a killer for some of our current use cases, but still we are going with Unity Catalog and managed tables on the other use case, looking forward to workspace separation feature (I think it's called that).

Optimized - I meant all the extra things we had on top on vanilla Spark i.e. delta caching (now called disk caching), but yes managed tables too.

I see some limitations still, but yeah I think it might be hard to maintain this platform the open way as it was, so I see also that I need to adapt to those changes, but sometimes it's not obvious.

Good example would be that I decided that I will setup DEV and PROD Unity Catalogs, to be able to test things before going to PROD env, for example new features, at the beginning 1 UC per region was recommended way, but still I was going to have 2 in 1 region, then one day I noticed it's not possible to create 2 UCs in 1 AWS Region for 1 Databricks Account, then you have to adapt your code and your design.

Databricks

Unity Catalog - great or limited?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI