cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the best way to use Unity catalog with medallion architecture using ADLS2

krishna007
New Contributor

Hi,

I am using a medallion architecture on Azure Data Lake Storage Gen2 with Azure Databricks. Currently, I am storing data in Parquet format (not Delta tables), and I am planning to implement Unity Catalog (UC).

As part of this setup, I understand that catalogs and schemas in UC require external locations. From an architecture and governance perspective, I am considering the following approaches:

Option 1: Single container for entire catalog

One container for the catalog
Separate folders inside the container for bronze, silver, and gold layers
Results in 4 external locations (1 for catalog + 3 for layers)
Data is logically separated (via folders), not physically (via containers)

Option 2: Three containers for layers, catalog within bronze

Separate containers for bronze, silver, and gold
Catalog stored inside the bronze container (in a separate folder)
Results in 4 external locations
Concern: mixes catalog storage with bronze layer, which may not align well with medallion principles

Option 3: Four separate containers

Separate containers for catalog, bronze, silver, and gold
Results in 4 external locations
Provides clear physical separation, but increases IAM and governance overhead

Question:
Which of these approaches is considered best practice from a scalability, governance, and Unity Catalog design perspective? Are there any recommended patterns for structuring storage and external locations when using UC with a medallion architecture?

2 REPLIES 2

Lu_Wang_ENB_DBX
Databricks Employee
Databricks Employee

Recommended high‑level pattern

  1. Design UC by domain, then medallion by schema
    • Use domain‑based catalogs (for example, sales, marketing, finance) or environment‑based (sales_dev, sales_prod).
    • Within each catalog, create schemas for medallion layers: sales.bronze, sales.silver, sales.gold (or similar).
  2. Use managed UC tables for bronze/silver/gold wherever possible
    • Databricks strongly recommends Unity Catalog managed tables for all lakehouse data (bronze through gold) and to reserve external tables only when data must stay in specific paths or be shared with non‑Databricks tools.
  3. Design external locations at the catalog boundary, not per medallion layer
    • Best practice is to create external locations at the highest common path prefix and to align them with catalog or schema boundaries (for example, one external location per catalog).
    • Explicitly define managed storage locations at catalog or schema, rather than relying on defaults.
    • External locations should be broad (often a whole container or a major sub‑path) and relatively few in number.

 

How that maps to your three options

Assumption: you’re talking about customer‑managed ADLS Gen2, and you’ll configure UC catalogs/schemas to use that storage via external locations.

Option 1 – Single container per catalog, folders for bronze/silver/gold

  • What it looks like
    • Storage:
      abfss://<catalog-container>@<account>.dfs.core.windows.net/
      with subfolders like /bronze, /silver, /gold (or just let UC manage layout).
    • UC:
      • Catalog: sales
      • Schemas: sales.bronze, sales.silver, sales.gold
      • One external location pointing to the container (or catalog root path) and used as the managed storage location for the catalog.
  • When to use
    • Most small to mid‑size or single‑domain deployments.
    • When you don’t have extreme scale or very strong isolation requirements between bronze/silver/gold.

If you follow UC patterns (domain catalogs + medallion schemas + managed tables), Option 1 is generally the best starting point.

 

Option 2 – Three containers for layers, catalog stored inside bronze container

  • Issue: this mixes UC catalog managed storage with raw bronze landing storage in the same container.
  • UC best practices explicitly caution against collapsing everything into a single storage account/container for managed storage and other external locations in storage‑intensive scenarios, and stress using external locations as broad governance boundaries that are not directly used for ad‑hoc access.
  • Operationally it also tangles:
    • Raw ingestion lifecycle (often broader write/delete rights, external writers).
    • Catalog managed storage (where UC should be the primary governor).

I’d avoid Option 2; it creates a confusing mixing of concerns.

 

Option 3 – Separate containers for catalog, bronze, silver, gold

  • What it means
    • One container dedicated to UC catalog managed storage (for that domain/env).
    • Separate containers for raw/bronze, silver, gold.
  • Pros
    • Strongest physical isolation (per‑container RBAC, network rules, lifecycle policies).
    • Aligns with guidance not to put all storage‑intensive workloads into a single container.
  • Cons
    • More IAM and operational overhead (more containers, more external locations, more policies).
    • Not necessary if medallion is already clearly governed via schemas and UC permissions, which is the recommended pattern.
  • When to use
    • Very large or regulated environments where you want container‑level isolation:
      • Different backup/retention policies per layer.
      • Different storage accounts/subscriptions per security domain.
    • Still not recommended, do this instead:
      • One or more containers per domain/environment (catalog).
      • Use bronze/silver/gold as schemas, not as separate containers.

So Option 3 is viable for high‑isolation scenarios, but you can usually simplify it: separate containers per domain/env, not strictly per medallion layer.

 

Recommendation

Given your description and desire for good scalability and governance:

  1. Model
    • Pick domain‑or env‑based catalogs (for example, ops, sales_prod, sales_dev).
    • In each, create bronze, silver, gold schemas.
  2. Storage + external locations (Azure)
    • For each domain/env, create one ADLS Gen2 container (Option‑1 style).
    • Create one storage credential (Access Connector) and one external location pointing to that container (or a top‑level path) and:
      • Use it as the managed storage location for the catalog (and optionally override at schema if needed).
    • For your existing Parquet medallion folders, either:
      • Define additional external locations at the relevant prefixes (if you must keep those exact paths) and register them as external tables/volumes; or
      • Migrate data into UC‑managed Delta tables under the catalog’s managed storage and gradually deprecate the old Parquet layout.
  3. When/if to introduce more containers (Option‑3)
    • Only if you hit:
      • Scale limits on a single storage account/container, or
      • Hard isolation requirements (for example, separate subscription/container for PII bronze vs non‑PII).
    • In that case, add more containers and corresponding external locations, still aligned to catalogs/domains, not to medallion schemas.

 

Summary

  • Option 1 (single container per catalog + medallion as folders/schemas) – recommended baseline and aligns well with UC architecture guidance when combined with medallion schemas and managed tables.
  • Option 2 (catalog inside bronze container) – not recommended; it mixes catalog managed storage with raw bronze, which is poor separation of concerns.
  • Option 3 (four containers) – good for strict isolation / very large scale, but usually overkill if medallion is already implemented at the schema level and governed via UC; treat it as an evolution from Option 1 when requirements justify it.

karthickrs
New Contributor III

Hi,

Option 2 is should be avoided.
The real decision is between Option 1 (simpler) and Option 3 (best practice).

Why OPTION 2 is a NO GO:

This violates separation of concerns:

 

  •  Mixes governance layer (catalog storage) with data layer (bronze)
  •  Harder to manage IAM cleanly
  •  Confusing lineage and ownership
  •  Breaks the model of medallion architecture

OPTION 3 (BEST PRACTICE):

Separate Containers for:

  • Catalog (managed storage)
  • Bronze
  • Silver
  • Gold

Why this is the best approach:

1. Strong governance boundaries

Each layer can have Separate IAM roles & Separate access policies

Example:

  • Bronze → ingestion team (write-heavy)
  • Silver → data engineering
  • Gold → BI / analytics users (read-heavy)

2. Clean Unity Catalog mapping

You can map external locations like:

abfss://bronze@storage.dfs.core.windows.net/
abfss://silver@storage.dfs.core.windows.net/
abfss://gold@storage.dfs.core.windows.net/
 

Then assign permissions as follows:

  • READ on gold
  • WRITE on bronze
  • etc.

3. Better scalability & isolation

 

  • Storage growth is isolated per layer

Option 1 is good but not the ideal one:

Pros

  • Faster to set up
  • Works fine for small/medium workloads

Cons

- Weaker governance

  • Hard to restrict access cleanly at folder level
  • Risk of accidental cross-layer access

- Less isolation

  • One misconfigured policy could impact everything

- Not ideal for multi-team environments

Karthick Ramachandran Seshadri
Data Architect | MS/MBA
Data + AI/ML/GenAI
17x Databricks Credentials