cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity Catalog and environment set up

Kjetil
Contributor

We are implementing the Databricks medallion architecture (bronze, silver, gold). We have three different environments/workspaces in Databricks: Dev, Test and Prod. Each catalog in Unity Catalog points to a specific place in the Azure Data Lake. It therefore seems that the (only?) solution then will be to name gold in dev 'gold_dev' and so on? That in turn means that we need to parameterize the env name and use this parameter that varies across environments in our code for the data/ml pipelines. 

Example of such a solution:

 

import os
from pyspark.sql import SparkSession

env = os.getenv("ENV", "dev")  # Default to 'dev' if not set
catalog_map = {
    "dev": "bronze_dev",
    "test": "bronze_test",
    "prod": "bronze_prod"
}
bronze_catalog = catalog_map[env]
spark = SparkSession.builder.getOrCreate()
df = spark.read.table(f"{bronze_catalog}.schema.table_name")

 

Question: Is the preferred solution, or is it possible to do it in another way?

Note: I've noticed that some recommends using dev, test and prod as catalogs, however we likely need to have more flexibility than simply using gold, silver and bronze schemas. That is why we lift these components to the catalog level, so that we below this level in the hierarchy can define specific schemas within the gold, silver, and bronze catalog.

2 REPLIES 2

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @Kjetil,

Your proposed solution of parameterizing the environment name and using this parameter in your code for the data/ML pipelines is a valid approach. This method allows you to dynamically select the appropriate catalog based on the environment, ensuring that your code can run seamlessly across different environments (Dev, Test, Prod).

However, there is an alternative approach that you might consider. Instead of naming the catalogs as gold_dev, gold_test, and gold_prod, you could use the environment names directly as catalog names (e.g., dev, test, prod). This approach is recommended by some because it simplifies the naming convention and makes it clear which environment you are working in.

import os

from pyspark.sql import SparkSession

 

env = os.getenv("ENV", "dev")  # Default to 'dev' if not set

catalog_map = {

    "dev": "dev",

    "test": "test",

    "prod": "prod"

}

catalog = catalog_map[env]

spark = SparkSession.builder.getOrCreate()

df = spark.read.table(f"{catalog}.schema.table_name")

Kjetil
Contributor

Thanks, Yes, that is indeed an option. The issue there is that we loose some flexibility in the sense that we cant define other sub-schemas to gold, silver, bronze as it would then be of the form prod.gold.<table-name>. instead of gold_dev.<schema-name>. <table-name>. I believe we need to be able to customize the schema's further than the prod.gold.<table-name> allows for, but nothing is settlet yet. Thank you for the reply.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group