cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best practice on how to set up a medallion architecture pipelines inside DAB

jeremy98
Honored Contributor

Hi Community,

My team and I are working on refactoring our folder repository structure. Currently, I have been placing pipelines related to the Medallion architecture inside a folder named notebook/. However, I believe they should be moved to src/ since we have also developed other pipelines that do not follow the Medallion architecture (Bronze, Silver, and Gold layers).

My main point is that some pipelines transform data from an old architecture into this new Medallion-based architecture, while others are developed exclusively within Databricks.

What do you think? Does this folder restructuring make sense?

 

src/
  - green_inference_pipeline/ (only new in the new architecture)
  - water_inference_pipeline/ (only new in the new architecture)
  - 00_bronze_to_01_silver_pt1.py (this is the pipeline used for ingesting data from the old architecture to our new one structure, HOW TO REFACTOR IT?)
  - 00_bronze_to_01_silver_pt2.py (this is the pipeline used for ingesting data from the old architecture to our new one structure, HOW TO REFACTOR IT?)
  - etc. with silver->gold and gold->portal

 

What would be intuitive names for pipelines that start from the Bronze layer but process only one table on a scheduled basis, with different schedules for different Bronze pipelines?

For example:

  • 00_bronze_fir_data_pipeline.py (runs daily at 1 AM)
  • 00_bronze_tiny_data_pipeline.py (runs daily at 2 AM)
  • 00_bronze_huge_data_pipeline.py (runs daily at 4 AM)

Do these naming conventions make sense, or would you suggest a more intuitive approach?

2 REPLIES 2

NandiniN
Databricks Employee
Databricks Employee

Checking.

mark_ott
Databricks Employee
Databricks Employee

Refactoring your folder structure and naming conventions for Medallion architecture pipelines is an essential step to keep code maintainable and intuitive. Based on your context, shifting these pipelines from notebook/ to src/ is a solid move, especially as your repository now contains more differentiated pipeline logic—including old-to-new architecture transformations and various processing routines for Databricks.

Folder Placement: notebook/ vs src/

  • Placing pipelines in src/ aligns with standard Python project structures where core logic lives under src/ and interactive/experimental code, like notebooks, goes in notebook/.

  • Pipelines for both Medallion (Bronze/Silver/Gold) and non-Medallion architectures are first-class production code, so keeping them together in src/ is logical.

  • For migration pipelines (old architecture → Medallion), it's best to house them in src/ alongside other ETL jobs, perhaps in a subfolder (e.g., migration/ or legacy_ingest/) if they grow in number.

Naming and Structure Suggestions

Your current naming (e.g., 00_bronze_to_01_silver_pt1.py) is clear about the flow but could be streamlined. Here are options:

1. Migration/Transformation Specific Pipelines

If possible, encapsulate old-to-new transformation steps in a dedicated subfolder:

text
src/ |-- migration/ |-- bronze_to_silver_pt1.py |-- bronze_to_silver_pt2.py |-- green_inference_pipeline/ |-- water_inference_pipeline/

Alternatively, prefix migration scripts with migration_ or legacy_.

2. Bronze Layer Pipelines with Scheduling

Your naming scheme (00_bronze_fir_data_pipeline.py, scheduling notes in comments/docs) is clear for newcomers. However, including schedule information in the file name itself can be noisy unless scheduling is a primary distinguishing factor.

Suggestions:

  • Stick with pipeline-specific names, and document the schedule in a config file (YAML, JSON, etc.), orchestration docs, or in pipeline code comments.

  • If schedules are a strong part of the identity, consider a naming template like:

    text
    bronze_fir_daily_pipeline.py bronze_tiny_daily_pipeline.py bronze_huge_daily_pipeline.py

    But avoid having time-of-day in the filename (e.g., _1am_), as that can quickly become brittle as schedules change.

3. More Intuitive Naming for Pipelines

  • Use names reflecting the data processed (e.g., bronze_orders_ingest.py), source, or domain instead of times.

  • Place all Bronze pipelines in a folder:

    text
    src/ |-- bronze/ |-- fir_data_pipeline.py |-- tiny_data_pipeline.py |-- huge_data_pipeline.py

    Then document schedules elsewhere.

Refactoring Migration Pipelines

  • Split transformation steps into composable functions or classes for clarity and reuse.

  • If steps are chained, consider a driver script or orchestration (Airflow, Databricks Jobs) that sequences them.

  • Use descriptive function names within modules:

    text
    def extract_legacy_bronze(): ... def transform_to_silver(): ...

    This makes the intent clear to future readers.

Summary Table

Folder Purpose Example Naming
src/ Main ETL pipelines (Medallion & other architectures) bronze_orders_ingest.py, green_inference/
migration/ Old-to-new transformation scripts bronze_to_silver_pt1.py
bronze/ Bronze layer table-specific pipelines fir_data_pipeline.py
 
 

Recommendations

  • Move all ETL pipelines to src/ regardless of architecture.

  • Organize migration scripts in a clear subfolder or with consistent prefix.

  • Name pipelines for their data/topic, not for their schedule or execution time.

  • Use documentation and metadata/configs to record scheduling, not filenames.

This restructuring and naming strategy will help your repo scale as your team and pipeline complexity grow.