Databricks Community

Kjetil · ‎02-05-2025

We are implementing the Databricks medallion architecture (bronze, silver, gold). We have three different environments/workspaces in Databricks: Dev, Test and Prod. Each catalog in Unity Catalog points to a specific place in the Azure Data Lake. It therefore seems that the (only?) solution then will be to name gold in dev 'gold_dev' and so on? That in turn means that we need to parameterize the env name and use this parameter that varies across environments in our code for the data/ml pipelines.

Example of such a solution:

import os
from pyspark.sql import SparkSession

env = os.getenv("ENV", "dev")  # Default to 'dev' if not set
catalog_map = {
    "dev": "bronze_dev",
    "test": "bronze_test",
    "prod": "bronze_prod"
}
bronze_catalog = catalog_map[env]
spark = SparkSession.builder.getOrCreate()
df = spark.read.table(f"{bronze_catalog}.schema.table_name")

Question: Is the preferred solution, or is it possible to do it in another way?

Note: I've noticed that some recommends using dev, test and prod as catalogs, however we likely need to have more flexibility than simply using gold, silver and bronze schemas. That is why we lift these components to the catalog level, so that we below this level in the hierarchy can define specific schemas within the gold, silver, and bronze catalog.

Alberto_Umana · ‎02-05-2025

Hello @Kjetil,

Your proposed solution of parameterizing the environment name and using this parameter in your code for the data/ML pipelines is a valid approach. This method allows you to dynamically select the appropriate catalog based on the environment, ensuring that your code can run seamlessly across different environments (Dev, Test, Prod).

However, there is an alternative approach that you might consider. Instead of naming the catalogs as gold_dev, gold_test, and gold_prod, you could use the environment names directly as catalog names (e.g., dev, test, prod). This approach is recommended by some because it simplifies the naming convention and makes it clear which environment you are working in.

import os

from pyspark.sql import SparkSession

env = os.getenv("ENV", "dev") # Default to 'dev' if not set

catalog_map = {

"dev": "dev",

"test": "test",

"prod": "prod"

}

catalog = catalog_map[env]

spark = SparkSession.builder.getOrCreate()

df = spark.read.table(f"{catalog}.schema.table_name")

Kjetil · ‎02-05-2025

Thanks, Yes, that is indeed an option. The issue there is that we loose some flexibility in the sense that we cant define other sub-schemas to gold, silver, bronze as it would then be of the form prod.gold.<table-name>. instead of gold_dev.<schema-name>. <table-name>. I believe we need to be able to customize the schema's further than the prod.gold.<table-name> allows for, but nothing is settlet yet. Thank you for the reply.

Databricks Community

Unity Catalog and environment set up

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!