Best practice on how to set up a medallion architecture pipelines inside DAB

jeremy98 — Wed, 12 Feb 2025 13:35:34 GMT

Hi Community,

My team and I are working on refactoring our folder repository structure. Currently, I have been placing pipelines related to the Medallion architecture inside a folder named notebook/. However, I believe they should be moved to src/ since we have also developed other pipelines that do not follow the Medallion architecture (Bronze, Silver, and Gold layers).

My main point is that some pipelines transform data from an old architecture into this new Medallion-based architecture, while others are developed exclusively within Databricks.

What do you think? Does this folder restructuring make sense?

src/ - green_inference_pipeline/ (only new in the new architecture) - water_inference_pipeline/ (only new in the new architecture) - 00_bronze_to_01_silver_pt1.py (this is the pipeline used for ingesting data from the old architecture to our new one structure, HOW TO REFACTOR IT?) - 00_bronze_to_01_silver_pt2.py (this is the pipeline used for ingesting data from the old architecture to our new one structure, HOW TO REFACTOR IT?) - etc. with silver->gold and gold->portal

What would be intuitive names for pipelines that start from the Bronze layer but process only one table on a scheduled basis, with different schedules for different Bronze pipelines?

For example:

00_bronze_fir_data_pipeline.py (runs daily at 1 AM)
00_bronze_tiny_data_pipeline.py (runs daily at 2 AM)
00_bronze_huge_data_pipeline.py (runs daily at 4 AM)

Do these naming conventions make sense, or would you suggest a more intuitive approach?

Re: Best practice on how to set up a medallion architecture pipelines inside DAB

NandiniN — Thu, 01 May 2025 07:00:01 GMT

Checking.

Re: Best practice on how to set up a medallion architecture pipelines inside DAB

mark_ott — Fri, 31 Oct 2025 15:17:08 GMT

Refactoring your folder structure and naming conventions for Medallion architecture pipelines is an essential step to keep code maintainable and intuitive. Based on your context, shifting these pipelines from notebook/ to src/ is a solid move, especially as your repository now contains more differentiated pipeline logic—including old-to-new architecture transformations and various processing routines for Databricks.

Folder Placement: notebook/ vs src/

Placing pipelines in src/ aligns with standard Python project structures where core logic lives under src/ and interactive/experimental code, like notebooks, goes in notebook/.
Pipelines for both Medallion (Bronze/Silver/Gold) and non-Medallion architectures are first-class production code, so keeping them together in src/ is logical.
For migration pipelines (old architecture → Medallion), it's best to house them in src/ alongside other ETL jobs, perhaps in a subfolder (e.g., migration/ or legacy_ingest/) if they grow in number.

Naming and Structure Suggestions

Your current naming (e.g., 00_bronze_to_01_silver_pt1.py) is clear about the flow but could be streamlined. Here are options:

1. Migration/Transformation Specific Pipelines

If possible, encapsulate old-to-new transformation steps in a dedicated subfolder:

text

src/
 |-- migration/
      |-- bronze_to_silver_pt1.py
      |-- bronze_to_silver_pt2.py
 |-- green_inference_pipeline/
 |-- water_inference_pipeline/

Alternatively, prefix migration scripts with migration_ or legacy_.

2. Bronze Layer Pipelines with Scheduling

Your naming scheme (00_bronze_fir_data_pipeline.py, scheduling notes in comments/docs) is clear for newcomers. However, including schedule information in the file name itself can be noisy unless scheduling is a primary distinguishing factor.

Suggestions:

Stick with pipeline-specific names, and document the schedule in a config file (YAML, JSON, etc.), orchestration docs, or in pipeline code comments.
If schedules are a strong part of the identity, consider a naming template like:

text

bronze_fir_daily_pipeline.py bronze_tiny_daily_pipeline.py bronze_huge_daily_pipeline.py

But avoid having time-of-day in the filename (e.g., _1am_), as that can quickly become brittle as schedules change.

3. More Intuitive Naming for Pipelines

Use names reflecting the data processed (e.g., bronze_orders_ingest.py), source, or domain instead of times.
Place all Bronze pipelines in a folder:

text

src/ |-- bronze/ |-- fir_data_pipeline.py |-- tiny_data_pipeline.py |-- huge_data_pipeline.py

Then document schedules elsewhere.

Refactoring Migration Pipelines

Split transformation steps into composable functions or classes for clarity and reuse.
If steps are chained, consider a driver script or orchestration (Airflow, Databricks Jobs) that sequences them.
Use descriptive function names within modules:

text

def extract_legacy_bronze(): ... def transform_to_silver(): ...

This makes the intent clear to future readers.

Summary Table

Folder	Purpose	Example Naming
src/	Main ETL pipelines (Medallion & other architectures)	bronze_orders_ingest.py, green_inference/
migration/	Old-to-new transformation scripts	bronze_to_silver_pt1.py
bronze/	Bronze layer table-specific pipelines	fir_data_pipeline.py

Recommendations

Move all ETL pipelines to src/ regardless of architecture.
Organize migration scripts in a clear subfolder or with consistent prefix.
Name pipelines for their data/topic, not for their schedule or execution time.
Use documentation and metadata/configs to record scheduling, not filenames.

This restructuring and naming strategy will help your repo scale as your team and pipeline complexity grow.

topic Best practice on how to set up a medallion architecture pipelines inside DAB in Data Engineering

Best practice on how to set up a medallion architecture pipelines inside DAB

Re: Best practice on how to set up a medallion architecture pipelines inside DAB

Re: Best practice on how to set up a medallion architecture pipelines inside DAB

Folder Placement: notebook/ vs src/

Naming and Structure Suggestions

1. Migration/Transformation Specific Pipelines

2. Bronze Layer Pipelines with Scheduling

3. More Intuitive Naming for Pipelines

Refactoring Migration Pipelines

Summary Table

Recommendations