10-07-2022 02:53 AM
Hi!
So I've been looking into trying Unity Catalog since it seems to add many great features.
But one thing I cant get my head around is the fact that we cant (shouldn't?) use multiple metastores in the same region in UC.
Let me explain my usecase:
We hava two environments development/production with one dbw each.
We are using meddalion architecture so our data is orginanized like:
bronze.source_system.dataset2
bronze.source_system.dataset1
Now what I want to do is to use this naming convention for all tables in UC, but thats not possible because tables stored in dev and prod will collide. And the solution to add a prefix / suffix somewhere in the table name is not very elegant imho.
We could do something like:
prod_bronze.source_system.dataset2
prod_bronze.source_system.dataset1
or
prod.bronze_source_system.dataset2
prod.bronze_source_system.dataset1
But then we need our codebase to keep track of which environment the code is being executed in to select the correct talbe in our pipeline tasks.
So what I would like to do is using one metastore per environment, which would also mitigate another issue for us: The fact that we have to store all managed tables in the same storage account even if they are created in different environments. That is really not an option for us, sure we can use external tables but that is still not great.
Thankfull for any input on this, how does your solution look when using UC in sandbox/dev/prod environments?
Thanks!
10-11-2022 10:57 AM
@Daniel Alteborg
That will come very soon - we can set storage location at the catalog level. (Expecting 2022 Q4 or Q1, 2023)
Right now we can segregate the data of Dev and Prod using external tables,
08-22-2023 11:24 PM
I think the answer to this issue is have accounts by environment. Would be better if Databricks introduced an Organisations features as per AWS.
10-07-2022 11:20 AM
@Daniel Alteborg
We limit num metastores per region to 1, because there is little utility in allowing more and it creates overhead.
catalog/schema isolation will help but still doesn’t take away all the complexity wrt namespace restrictions. Just forget metastore as a construct and start with catalogs and design from there
01-12-2023 02:19 AM
Hi @Sivaprasad C S what if we have different instances of ADLS for dev/qa/prod (but in the same region). Because we would want at access External table locations present in ADLS. Can we create different megastores for Dev/QA/Prod in the same region?
01-16-2023 09:07 AM
There is a lot of utility in being able to sperate dev/qa/prod data. We do not (and in some cases can not) have prod data accessable in dev environments/workspaces, or have dev data available in prod environments/workspaces
As it is at the moment I do not see how we can use unity catalog within our environment. We cannot reproduce the same level of isolation that hive_metastore provided so there is no direct upgrade path. This is a shame as there are a lot of other great features that require unity catalog that we cant utilize
I do hope the metastore limit restriction is reassessed
07-27-2023 04:43 AM
I think we really appreciate you caring about our challenges with complexity and overhead. But the thing is that each org is different and has different needs. And this constraint is a real deal breaker.
Because there is only one metastore per region and AAD tenant, I am not able to use Unity Catalog at my organization. Having ability to have multiple metastores - each with different admin would solve that.
10-10-2022 11:57 PM
@Sivaprasad C S
Hi, ok so we will try to work with the schema.table levels to separate our environments.
But is it possible / will it be possible to use different storage for managed tables in one metastore? Since we will be using the same metastore for dev and prod it wount be an option for us to use same storage. Then we will need to use external tables?
//Daniel
10-11-2022 10:57 AM
@Daniel Alteborg
That will come very soon - we can set storage location at the catalog level. (Expecting 2022 Q4 or Q1, 2023)
Right now we can segregate the data of Dev and Prod using external tables,
04-13-2023 09:26 AM
Any update on this? We also have separate AWS accounts for different environments and separating the data using catalog/schema's is not a viable solution for us.
05-24-2023 05:49 AM
any update? Same problem here. we want to isolate data between environments(=aws accounts)
07-27-2023 04:30 AM
I very much like the features that come with unity catalog. But at the same time I find it extremally challenging to implement this in a big organization in its current form due to 1-1 relation to AAD tenant and 1 metastore constraint.
We have one AAD tenant used by multiple business groups that run multiple products. They are from different industries, have little to do with each other. I am an architect on one of such products. We have multiple envs with multiple lakes and DB workspaces. Sounds like a good use case for us right? Well not so fast.
There are organizational questions that are difficult to answer:
1) Who will be managing the "account"? Our AAD global admins know nothing about Databricks and they dont want to mange this stuff (give permissions, create catalogs etc.). So it has to be deletaged - but to whom? It could be me, but it means I will be able to control access other's business groups catalogs. Will they agree to that? It also means I'll be dealing with their requests all the time. So it means there has to be some "company wide Databricks admin" nominated who will be managing all this stuff. Getting that done is not easy.
2) Who will be hosting and managing the common metastore storage account and access connector? Since its for entire org, it falls into some "common infra / landing zone" bucket, usually managed by some central infra team. So you need to onboard them.
3) What about automation? I'd like to have an SPN that can for instance create catalogs and use it for my CI/CD. But for now, there are no granular permissions on metastore level - either you are admin or not. Having an "admin" SPN that can create and control access to all catalogs in metastore (that may belong to multiple business groups) - not only its close to impossible but its also stupid.
All these problems come down to one thing - why does this have to be tied to AAD tenant? Or why can't we have multiple metastores per region - each product/product group having and managing its own? Then everyone would take care of their own stuff and everyone would be happy!
08-15-2024 03:32 PM
In my view there will be no solution you can expect from Databricks.
The main challenge is the shared handling of
- metadata managed by Databricks in their account
VS
- your data being managed in your cloud storage.
So their (Databricks) design consideration (on UC-MS per region per account) turns out to be percieved as a design constraint for users of UC meta-stores as outlined very well by wojciech_jakubo and others.
Unfortunately the users need to work around this design "constraint".
A hierarchical namespace of catalogs can help but did not see this being mentioned in recent Data&AI summit.
08-22-2023 11:24 PM
I think the answer to this issue is have accounts by environment. Would be better if Databricks introduced an Organisations features as per AWS.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group