Databricks Community

jeremy98 · ‎02-12-2025

Hi Community,

My team and I are working on refactoring our folder repository structure. Currently, I have been placing pipelines related to the Medallion architecture inside a folder named notebook/. However, I believe they should be moved to src/ since we have also developed other pipelines that do not follow the Medallion architecture (Bronze, Silver, and Gold layers).

My main point is that some pipelines transform data from an old architecture into this new Medallion-based architecture, while others are developed exclusively within Databricks.

What do you think? Does this folder restructuring make sense?

src/
  - green_inference_pipeline/ (only new in the new architecture)
  - water_inference_pipeline/ (only new in the new architecture)
  - 00_bronze_to_01_silver_pt1.py (this is the pipeline used for ingesting data from the old architecture to our new one structure, HOW TO REFACTOR IT?)
  - 00_bronze_to_01_silver_pt2.py (this is the pipeline used for ingesting data from the old architecture to our new one structure, HOW TO REFACTOR IT?)
  - etc. with silver->gold and gold->portal

What would be intuitive names for pipelines that start from the Bronze layer but process only one table on a scheduled basis, with different schedules for different Bronze pipelines?

For example:

00_bronze_fir_data_pipeline.py (runs daily at 1 AM)
00_bronze_tiny_data_pipeline.py (runs daily at 2 AM)
00_bronze_huge_data_pipeline.py (runs daily at 4 AM)

Do these naming conventions make sense, or would you suggest a more intuitive approach?

NandiniN · ‎05-01-2025

Checking.

mark_ott · ‎10-31-2025

Refactoring your folder structure and naming conventions for Medallion architecture pipelines is an essential step to keep code maintainable and intuitive. Based on your context, shifting these pipelines from notebook/ to src/ is a solid move, especially as your repository now contains more differentiated pipeline logic—including old-to-new architecture transformations and various processing routines for Databricks.

Folder Placement: notebook/ vs src/

Placing pipelines in src/ aligns with standard Python project structures where core logic lives under src/ and interactive/experimental code, like notebooks, goes in notebook/.
Pipelines for both Medallion (Bronze/Silver/Gold) and non-Medallion architectures are first-class production code, so keeping them together in src/ is logical.
For migration pipelines (old architecture → Medallion), it's best to house them in src/ alongside other ETL jobs, perhaps in a subfolder (e.g., migration/ or legacy_ingest/) if they grow in number.

Naming and Structure Suggestions

Your current naming (e.g., 00_bronze_to_01_silver_pt1.py) is clear about the flow but could be streamlined. Here are options:

1. Migration/Transformation Specific Pipelines

If possible, encapsulate old-to-new transformation steps in a dedicated subfolder:

text

src/
 |-- migration/
      |-- bronze_to_silver_pt1.py
      |-- bronze_to_silver_pt2.py
 |-- green_inference_pipeline/
 |-- water_inference_pipeline/

Alternatively, prefix migration scripts with migration_ or legacy_.

2. Bronze Layer Pipelines with Scheduling

Your naming scheme (00_bronze_fir_data_pipeline.py, scheduling notes in comments/docs) is clear for newcomers. However, including schedule information in the file name itself can be noisy unless scheduling is a primary distinguishing factor.

Suggestions:

Stick with pipeline-specific names, and document the schedule in a config file (YAML, JSON, etc.), orchestration docs, or in pipeline code comments.
If schedules are a strong part of the identity, consider a naming template like:

text

bronze_fir_daily_pipeline.py bronze_tiny_daily_pipeline.py bronze_huge_daily_pipeline.py

But avoid having time-of-day in the filename (e.g., _1am_), as that can quickly become brittle as schedules change.

3. More Intuitive Naming for Pipelines

Use names reflecting the data processed (e.g., bronze_orders_ingest.py), source, or domain instead of times.
Place all Bronze pipelines in a folder:

text

src/ |-- bronze/ |-- fir_data_pipeline.py |-- tiny_data_pipeline.py |-- huge_data_pipeline.py

Then document schedules elsewhere.

Refactoring Migration Pipelines

Split transformation steps into composable functions or classes for clarity and reuse.
If steps are chained, consider a driver script or orchestration (Airflow, Databricks Jobs) that sequences them.
Use descriptive function names within modules:

text

def extract_legacy_bronze(): ... def transform_to_silver(): ...

This makes the intent clear to future readers.

Summary Table

Folder	Purpose	Example Naming
src/	Main ETL pipelines (Medallion & other architectures)	bronze_orders_ingest.py, green_inference/
migration/	Old-to-new transformation scripts	bronze_to_silver_pt1.py
bronze/	Bronze layer table-specific pipelines	fir_data_pipeline.py