Metastore - One per Account/Region Limitation
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2023 11:47 PM
Looking at Databricks’ suggested use of catalogs. My instincts are now leading me to the conclusion having separate metastore for each SDLC environment (dev, test, prod) is preferable. I think if this pattern were followed, this means due to current constraints, a separate account for each environment is required as we would not want to be in different regions for the same account. This approach yields the full benefits of a three-level namespace as you are not giving up the top level to an environment as per this "best practice"
https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#--or...
My rationale:
- By dedicating a catalog to an environment, you do not get the full benefit of the three-level namespace e.g. for a source dataset:
Catalog - bronze_systemA (One catalog dedicated to a source in the environment’s metastore)
Schema – raw
Schema child objects
Schema – historised (optional if you need to collect time series data from source)
Schema child objects
Schema – curated (optional curation of source data without aggregating to other sources as you would in silver.)
Schema child objects
Is better than:
Catalog - bronze_all_systems_dev (one catalog dedicated to all sources by environment in same metastore)
Schema - systemA_raw
Schema child objects
Schema - systemA_historised
Schema child objects
Schema - systemA_curated
Schema child objects
Many more schemas
“
“
“
- On platform deployments from lower to higher environments would not have to manage the change in catalog name where an object is referenced e.g. A view’s SQL definition:
…..
FROM bronze_systemA.raw.table_abc
Is better than:
…..
FROM bronze_all_systems_dev.systemA_raw.table_abc
When deploying to higher environments “_dev” needs to change.
I anticipate this may also apply to other objects such as:
Workflows
DLT
Jobs
Maybe more ... - An external connection in an external tool will only have change connection string for the higher environment and not catalog name.
- Binding of catalogs to workspaces provides a clean method to manage data access compared to cherry picking schemas into ACLs and associating with authorised users.
Interested if I have missed something and other points of view.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-23-2023 01:16 AM
So basically Databricks advises one metastore for multiple envs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-23-2023 03:46 PM
Yes I am aware of that. I'm not convinced this is a "best practice".
It means that if you stay in the same metastore, use catalogs to divide up your environments as Databricks show, you have to deal with a changing three level namespace. You really you only get a two level name space as you have given the top level away to an environment.
My main concern is dealing with deploying objects from lower to higher environments that have to deal with the changing namespace. Not only on platform, but for external tools as well.
I am wondering how others are dealing with that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-23-2023 11:55 PM
I understand your concern.
However, changing the catalog name while deploying can be handled by putting the 'environment' in external config and updating that while deploying.
If you want strictly separated envs, having one catalog per env is an option but I am not sure if that is even possible using Unity for the moment. AFAIK you can only have one metastore per region.
Perhaps that will change in the future.
So for the moment you are stuck with workspace-catalog binding and using a variable env name.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-30-2023 03:57 PM
You can create multiple metastores for each region within an account. This is not a hard constraint, reach out to account team and they can make an exception. Before doing that, consider what kind of securable sharing you will need between dev, test and prod (on different metastores). Some data science use cases will need a different sharing needs than data engineering use cases.