cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Unity Catalog - multiple metastore in same region

Marra
New Contributor III

Hi!

So I've been looking into trying Unity Catalog since it seems to add many great features.

But one thing I cant get my head around is the fact that we cant (shouldn't?) use multiple metastores in the same region in UC.

Let me explain my usecase:

We hava two environments development/production with one dbw each.

We are using meddalion architecture so our data is orginanized like:

bronze.source_system.dataset2

bronze.source_system.dataset1

Now what I want to do is to use this naming convention for all tables in UC, but thats not possible because tables stored in dev and prod will collide. And the solution to add a prefix / suffix somewhere in the table name is not very elegant imho.

We could do something like:

prod_bronze.source_system.dataset2

prod_bronze.source_system.dataset1

or

prod.bronze_source_system.dataset2

prod.bronze_source_system.dataset1

But then we need our codebase to keep track of which environment the code is being executed in to select the correct talbe in our pipeline tasks.

So what I would like to do is using one metastore per environment, which would also mitigate another issue for us: The fact that we have to store all managed tables in the same storage account even if they are created in different environments. That is really not an option for us, sure we can use external tables but that is still not great.

Thankfull for any input on this, how does your solution look when using UC in sandbox/dev/prod environments?

Thanks!

2 ACCEPTED SOLUTIONS

Accepted Solutions

Sivaprasad1
Valued Contributor II

@Daniel Alteborgโ€‹ 

That will come very soon - we can set storage location at the catalog level. (Expecting 2022 Q4 or Q1, 2023)

Right now we can segregate the data of Dev and Prod using external tables,

View solution in original post

AdamMcGuinness
New Contributor III

I think the answer to this issue is have accounts by environment. Would be better if Databricks introduced an Organisations features as per AWS.

View solution in original post

13 REPLIES 13

Sivaprasad1
Valued Contributor II

@Daniel Alteborgโ€‹ 

We limit num metastores per region to 1, because there is little utility in allowing more and it creates overhead.

catalog/schema isolation will help but still doesnโ€™t take away all the complexity wrt namespace restrictions. Just forget metastore as a construct and start with catalogs and design from there

Hi @Sivaprasad C Sโ€‹  what if we have different instances of ADLS for dev/qa/prod (but in the same region). Because we would want at access External table locations present in ADLS. Can we create different megastores for Dev/QA/Prod in the same region?

There is a lot of utility in being able to sperate dev/qa/prod data. We do not (and in some cases can not) have prod data accessable in dev environments/workspaces, or have dev data available in prod environments/workspaces

As it is at the moment I do not see how we can use unity catalog within our environment. We cannot reproduce the same level of isolation that hive_metastore provided so there is no direct upgrade path. This is a shame as there are a lot of other great features that require unity catalog that we cant utilize

I do hope the metastore limit restriction is reassessed

I think we really appreciate you caring about our challenges with complexity and overhead. But the thing is that each org is different and has different needs. And this constraint is a real deal breaker. 

Because there is only one metastore per region and AAD tenant, I am not able to use Unity Catalog at my organization. Having ability to have multiple metastores - each with different admin would solve that.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Daniel Alteborgโ€‹ โ€‹, We havenโ€™t heard from you on the last response from @Sivaprasad C Sโ€‹ , and I was checking back to see if you have a resolution yet.

If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Marra
New Contributor III

@Sivaprasad C Sโ€‹ 

Hi, ok so we will try to work with the schema.table levels to separate our environments.

But is it possible / will it be possible to use different storage for managed tables in one metastore? Since we will be using the same metastore for dev and prod it wount be an option for us to use same storage. Then we will need to use external tables?

//Daniel

Sivaprasad1
Valued Contributor II

@Daniel Alteborgโ€‹ 

That will come very soon - we can set storage location at the catalog level. (Expecting 2022 Q4 or Q1, 2023)

Right now we can segregate the data of Dev and Prod using external tables,

171499
New Contributor III

Any update on this? We also have separate AWS accounts for different environments and separating the data using catalog/schema's is not a viable solution for us.

alemo
New Contributor III

any update? Same problem here. we want to isolate data between environments(=aws accounts)

Kaniz_Fatma
Community Manager
Community Manager

Hi @Daniel Alteborgโ€‹ , We havenโ€™t heard from you since the last response from @Sivaprasad C Sโ€‹ , and I was checking back to see if you have a resolution yet.

If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

wojciech_jakubo
New Contributor III

I very much like the features that come with unity catalog. But at the same time I find it extremally challenging to implement this in a big organization in its current form due to 1-1 relation to AAD tenant and 1 metastore constraint.

We have one AAD tenant used by multiple business groups that run multiple products. They are from different industries, have little to do with each other. I am an architect on one of such products. We have multiple envs with multiple lakes and DB workspaces. Sounds like a good use case for us right? Well not so fast.

There are organizational questions that are difficult to answer:

1) Who will be managing the "account"? Our AAD global admins know nothing about Databricks and they dont want to mange this stuff (give permissions, create catalogs etc.). So it has to be deletaged - but to whom? It could be me, but it means I will be able to control access other's business groups catalogs. Will they agree to that? It also means I'll be dealing with their requests all the time. So it means there has to be some "company wide Databricks admin" nominated who will be managing all this stuff. Getting that done is not easy.

2) Who will be hosting and managing the common metastore storage account and access connector? Since its for entire org, it falls into some "common infra / landing zone" bucket, usually managed by some central infra team. So you need to onboard them.

3) What about automation? I'd like to have an SPN that can for instance create catalogs and use it for my CI/CD. But for now, there are no granular permissions on metastore level - either you are admin or not. Having an "admin" SPN that can create and control access to all catalogs in metastore (that may belong to multiple business groups) - not only its close to impossible but its also stupid.

All these problems come down to one thing - why does this have to be tied to AAD tenant? Or why can't we have multiple metastores per region - each product/product group having and managing its own? Then everyone would take care of their own stuff and everyone would be happy!

In my view there will be no solution you can expect from Databricks.

The main challenge is the shared handling of 

- metadata managed by Databricks in their account
VS
- your data being managed in your cloud storage.

So their (Databricks) design consideration (on UC-MS per region per account) turns out to be percieved as a design constraint for users of UC meta-stores as outlined very well by wojciech_jakubo and others.

Unfortunately the users need to work around this design "constraint".
A hierarchical namespace of catalogs can help but did not see this being mentioned in recent Data&AI summit.

AdamMcGuinness
New Contributor III

I think the answer to this issue is have accounts by environment. Would be better if Databricks introduced an Organisations features as per AWS.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group