Databricks Community

turagittech · 2 weeks ago

Hi all,

I am working to resolve how to build tables in a development workspace catalog and then easily migrate the code to a production catalog without manually altering the schema name. For those unaware, you can't have the same catalog names in development as in productions. The catalog names are not unique to the workspace, but the catalog.

Maybe I am misunderstanding something about how this is to be approached, so happy to be guided, I haven't as yet found the document with that guidance. I might look for a cluster or Serverless warehouse tag to know what environment the catalog I am using to run against.
I could go to an external tool like Liquibase, but I wish to avoid that if I can. I am happy to manually run scrittps for deplyments, but don't want to have to go through and update code everytime I test or build now deployments

This feels like a glaring deficiency in the documentation and/or tooling

BigRoux · 2 weeks ago

This is a common challenge when working with Databricks Unity Catalog across different environments. The issue stems from the fact that catalog names must be unique across all workspaces sharing the same metastore, requiring different naming for development and production environments.

Environment-Based Catalog Naming Approaches

There are several recommended approaches to handle this situation:

1. Use Environment Parameters in Notebooks

You can pass environment parameters to your notebooks to dynamically reference the appropriate catalog:

```python
# At the beginning of your notebook
env = dbutils.widgets.get("environment") or "dev"
catalog_name = f"{env}_my_catalog"

Then use the dynamic catalog name in your queries
spark.sql(f"USE CATALOG {catalog_name}")
```

This approach allows the same code to run in different environments by simply changing the parameter value. For example, DEV workspace would be passed "dev", QA would use "qa", and PROD would use "prod" to enable the workspace notebooks to read and write from the environment-specific catalog.

### 2. Environment-Prefixed Catalog Structure

A common naming convention is to prefix your catalogs with the environment name:
- dev_finance
- stg_finance
- prod_finance

This makes it clear which environment each catalog belongs to and maintains consistency across environments.

3. Use Terraform for Deployment

You can use Terraform to manage deployments with environment variables in project folders that drive what parameters are used, including schemas. This provides a more standardized approach to managing environment-specific configurations.

Migration Tools

If you need more robust schema migration capabilities:

1. Liquibase: Databricks officially supports Liquibase for database schema migrations.

2. Alembic: Can be used for schema evolution in Databricks, particularly useful if you're already using Python and SQLAlchemy.

3. DBT: Some teams have migrated to DBT for handling schema migrations, which can make it easier to manage schema changes in a Hive metastore environment.

4. Custom Migration Scripts: You could develop a custom solution where SQL scripts are named to ensure proper execution order, with tracking of which migrations have been executed in each environment.

Best Practice Recommendation

The most straightforward approach is to:

1. Use environment parameters in your code to dynamically reference the appropriate catalog
2. Follow a consistent naming convention (like environment prefixing)
3. Implement a deployment process that handles the parameter substitution automatically

This way, you don't need to manually alter schema names when moving between environments, and your code remains environment-agnostic.

View solution in original post

BigRoux · 2 weeks ago