โ08-22-2023 11:47 PM
Looking at Databricksโ suggested use of catalogs. My instincts are now leading me to the conclusion having separate metastore for each SDLC environment (dev, test, prod) is preferable. I think if this pattern were followed, this means due to current constraints, a separate account for each environment is required as we would not want to be in different regions for the same account. This approach yields the full benefits of a three-level namespace as you are not giving up the top level to an environment as per this "best practice"
https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices#--or...
My rationale:
โ
โ
โ
Interested if I have missed something and other points of view.
Thanks
โ08-23-2023 01:16 AM
So basically Databricks advises one metastore for multiple envs.
โ08-23-2023 03:46 PM
Yes I am aware of that. I'm not convinced this is a "best practice".
It means that if you stay in the same metastore, use catalogs to divide up your environments as Databricks show, you have to deal with a changing three level namespace. You really you only get a two level name space as you have given the top level away to an environment.
My main concern is dealing with deploying objects from lower to higher environments that have to deal with the changing namespace. Not only on platform, but for external tools as well.
I am wondering how others are dealing with that?
โ06-14-2025 02:12 PM
We had this same dilemma, and ended up leveraging Secrets in the Key Vault Backed Secret Store and dynamic notebooks to determine which "environment" we were in. Based upon that, the notebook would set variables to properly set the appropriate catalog.
import os
DEFAULT_SECRET_SCOPE = os.environ.get("DEFAULT_SECRET_SCOPE")
# Check to see if the default secret scope has been configured
if DEFAULT_SECRET_SCOPE == None or len(DEFAULT_SECRET_SCOPE) == 0:
raise Exception("A default secret scope has not been configured on the selected cluster.")
# Retrieve the environment name to be used to determine which catalog to operate in
ENVIRONMENT = dbutils.secrets.get(scope=DEFAULT_SECRET_SCOPE, key="environment-name")
CATALOG_NAME = f"{ENVIRONMENT}_bronze"
โ08-23-2023 11:55 PM
I understand your concern.
However, changing the catalog name while deploying can be handled by putting the 'environment' in external config and updating that while deploying.
If you want strictly separated envs, having one catalog per env is an option but I am not sure if that is even possible using Unity for the moment. AFAIK you can only have one metastore per region.
Perhaps that will change in the future.
So for the moment you are stuck with workspace-catalog binding and using a variable env name.
โ11-30-2023 03:57 PM
You can create multiple metastores for each region within an account. This is not a hard constraint, reach out to account team and they can make an exception. Before doing that, consider what kind of securable sharing you will need between dev, test and prod (on different metastores). Some data science use cases will need a different sharing needs than data engineering use cases.
โ04-11-2025 06:29 AM
Glad to see its not just me thinking about multiple metastores. Separate metastores by environment makes total sense. This would have complete isolation between environments, also in your dev, stg, prod you can reuse catalog names without having to use prefix or something along those lines. also if you want to maintain physical storage separation between env you can do this at metastore level. Did anyone implement this? Keen to hear their experience and what could be the limitations of such setup as right now cant think of any
โ06-14-2025 02:19 PM
We went with the single metastore as we ran some experiments with multiple and ran into issues especially when dealing with Unity Catalog lineage and needing to back copying data from higher to lower environments to run various load testing scenarios. Ended up storing environment configuration information in Key Vault and reading at runtime from notebooks.
import os
DEFAULT_SECRET_SCOPE = os.environ.get("DEFAULT_SECRET_SCOPE")
# Check to see if the default secret scope has been configured
if DEFAULT_SECRET_SCOPE == None or len(DEFAULT_SECRET_SCOPE) == 0:
raise Exception("A default secret scope has not been configured on the selected cluster.")
# Retrieve the environment name to be used to determine which catalog to operate in
ENVIRONMENT = dbutils.secrets.get(scope=DEFAULT_SECRET_SCOPE, key="environment-name")
CATALOG_NAME = f"{ENVIRONMENT}_bronze"
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now