Unity Catalog - great or limited?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ08-30-2022 10:39 PM
Hi Guys,
I'd like to know how you have adopted the Unity Catalog. To me it seems rather limited in terms of that you are only allowed to have one UC per region. We are building out a data warehouse and we have three workspaces - dev, test and prod. The naming standard that we thought we could use where catalog.sourcesystem_name.table when for instance staging our source data. As I read the documentation we would need to add a suffix/prefix somewhere to able to distinguish between which environment we are using since all data now is visible across workspaces.
I would like to get some input how you have implemented the UC and are using different workspaces for different purposes.
Br,
Carsten
- Labels:
-
Catalog
-
Unity Catalog
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ08-31-2022 02:34 AM
Hi Carsten,
in that case, maybe you could create multiple catalogues in the one metastore:
dev.source_system_name_1.table_1
dev.source_system_name_2.table_1
...
test.source_system_name_1.table_1
test.source_system_name_2.table_1
...
prod.source_system_name_1.table_1
prod.source_system_name_2.table_1
There is an interesting video from this year's summit: https://youtu.be/ibvG-pYKl8U?t=852
"that you are only allowed to have one UC per region" <-- I think this is recommended approach, but you should be able to create multiple metastores in one region. I think about testing this approach, to have DEV and PROD in the same region - different buckets. Another option could be creating DEV in a different region.
I think the limitation here with this approach is that you will have to create all the managed tables in only one bucket - assigned to the UC.
"since all data now is visible across workspaces." - this can be limited by creating multiple roles.
You can have dev-data-eng, test-data-eng, and prod-data-eng roles created at the account level then you bring only dev-data-eng into your dev workspace, and so on.
The limitation here is that super-user will still be able to access all the data unless you won't allow Unity Catalog admin access to the workspace (not sure if this is possible, I am checking this now).
I've got a bit confused about this and I've mixed both workspace account privileges and data privileges.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-04-2022 11:18 PM
Thanks for the share of the presentation. We have used it as input on how to design our UC setup. One other limitation is now that if we want to use the recommended approach of managed tables, then we are forced to use one storage account for all three environments. This we cannot accept and therefore is forced to go with the external tables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-05-2022 12:37 AM
Hi @Carsten Klausmanโ ,
this is a problem for me as well, I am not able to store all the data in the one bucket.
I will have to use both EXTERNAL and MANAGED tables.
Being able to see all the data from the each workspace it's bit of a pain for me as well, I want to isolate some workspaces to allow to read only the specific data, I might need to go with both UC enabled workspaces and non-UC.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-13-2022 01:52 AM
I am looking into Unity too. I'd say it is both great and limited, but more limited than great.
Great because you have some very interesting features like column/row-based access, and lineage.
But it is still very limited because on the heavy focus on tables and delta lake.
They kinda seem to have forgotten that A LOT of data still resides in common parquet files.
f.e.
Overwrite mode for DataFrame write operations into Unity Catalog is supported only for Delta tables, not for other file formats.
This alone makes me wonder if we should use it, or look into DataHub/Amundsen.
It is still a new product, so new features might get added, but right now I probably will not be using it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-24-2022 07:57 PM
@Pat Sienkiewiczโ and @Carsten Klausmanโ I am implementing Unity and indeed i had to use prefix to separate dev, test , prod catalogs in 1 unity metastore because all my workspaces are in 1 azure region so I can only create 1 metastore. This is not a limitation though as our team has to follow this new standard for naming catalogs. Then on prod catalog, team has read-only permission. The biggest limitation ( and @Kaniz Fatmaโ I have submitted a feature request to databricks ) so far is not being able to differentiate between DML ( insert /update/ delete) and DDL permission on schema or table. Granting modify means user can change table schema and table data. Ideally modify should limit to DDL permission AND there should be 3 separate permissions (insert /update/ delete ) like in relational db but I understand it's not easy to implement this on parquet files. I am optimistic that soon databricks will bring this feature @Vivian Wilfredโ
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-28-2022 12:15 AM
Hi @Carsten Klausmanโ ,
How is your journey with the Unity Catalog?
I am seeing more and more limitations recently. It's pity, we had this freedom to work with Spark - ACID and optimized.
We are being forced to do things with the Unity Catalog, and seasoned Data Engineers have to talk to support to learn that some things doesn't work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-31-2022 12:38 AM
Hi @Pat Sienkiewiczโ
It is not a smooth ride. The whole permission model is not optimal. We had to add SQL statements that alters the permissions into our code. Optimized? - Do you mean the managed tables. We didn't go that route due to the limitations of using only one storage account and we want to separate our environments.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-01-2022 01:30 AM
Hi @Carsten Klausmanโ ,
the one storage account it's a killer for some of our current use cases, but still we are going with Unity Catalog and managed tables on the other use case, looking forward to workspace separation feature (I think it's called that).
Optimized - I meant all the extra things we had on top on vanilla Spark i.e. delta caching (now called disk caching), but yes managed tables too.
I see some limitations still, but yeah I think it might be hard to maintain this platform the open way as it was, so I see also that I need to adapt to those changes, but sometimes it's not obvious.
Good example would be that I decided that I will setup DEV and PROD Unity Catalogs, to be able to test things before going to PROD env, for example new features, at the beginning 1 UC per region was recommended way, but still I was going to have 2 in 1 region, then one day I noticed it's not possible to create 2 UCs in 1 AWS Region for 1 Databricks Account, then you have to adapt your code and your design.