cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

Metastore - One per Account/Region Limitation

AdamMcGuinness
New Contributor III

Looking at Databricks’ suggested use of catalogs. My instincts are now leading me to the conclusion having separate metastore for each SDLC environment (dev, test, prod) is preferable. I think if this pattern were followed, this means due to current constraints, a separate account for each environment is required as we would not want to be in different regions for the same account. This approach yields the full benefits of a three-level namespace as you are not giving up the top level to an environment as per this "best practice"
https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#--or...

My rationale:

  • By dedicating a catalog to an environment, you do not get the full benefit of the three-level namespace e.g. for a source dataset:

    Catalog - bronze_systemA (One catalog dedicated to a source in the environment’s metastore)
              Schema – raw
                        Schema child objects
              Schema – historised (optional if you need to collect time series data from source)
                        Schema child objects
              Schema – curated (optional curation of source data without aggregating to other sources as you would in silver.)
                        Schema child objects

    Is better than:

    Catalog - bronze_all_systems_dev (one catalog dedicated to all sources by environment in same metastore)
               Schema - systemA_raw
                        Schema child objects
              Schema - systemA_historised
                        Schema child objects
              Schema - systemA_curated
                        Schema child objects
              Many more schemas

                          “
                          “

                          “

  • On platform deployments from lower to higher environments would not have to manage the change in catalog name where an object is referenced e.g. A view’s SQL definition:
    …..
    FROM bronze_systemA.raw.table_abc

    Is better than:
    …..
    FROM bronze_all_systems_dev.systemA_raw.table_abc

    When deploying to higher environments “_dev” needs to change.

    I anticipate this may also apply to other objects such as:
       Workflows
       DLT
       Jobs
       Maybe more ...

  • An external connection in an external tool will only have change connection string for the higher environment and not catalog name.

  • Binding of catalogs to workspaces provides a clean method to manage data access compared to cherry picking schemas into ACLs and associating with authorised users.

Interested if I have missed something and other points of view.

Thanks

4 REPLIES 4

-werners-
Esteemed Contributor III

Yes I am aware of that. I'm not convinced this is a "best practice".

It means that if you stay in the same metastore, use catalogs to divide up your environments as Databricks show, you have to deal with a changing three level namespace. You really you only get a two level name space as you have given the top level away to an environment.

My main concern is dealing with deploying objects from lower to higher environments that have to deal with the changing namespace. Not only on platform, but for external tools as well.

I am wondering how others are dealing with that?

-werners-
Esteemed Contributor III

I understand your concern.
However, changing the catalog name while deploying can be handled by putting the 'environment' in external config and updating that while deploying.
If you want strictly separated envs, having one catalog per env is an option but I am not sure if that is even possible using Unity for the moment.  AFAIK you can only have one metastore per region.
Perhaps that will change in the future.
So for the moment you are stuck with workspace-catalog binding and using a variable env name.

SSundaram
Contributor

You can create multiple metastores for each region within an account. This is not a hard constraint, reach out to account team and they can make an exception. Before doing that, consider what kind of securable sharing you will need between dev, test and prod (on different metastores). Some data science use cases will need a different sharing needs than data engineering use cases. 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.