Hi Community,
My team and I are working on refactoring our folder repository structure. Currently, I have been placing pipelines related to the Medallion architecture inside a folder named notebook/. However, I believe they should be moved to src/ since we have also developed other pipelines that do not follow the Medallion architecture (Bronze, Silver, and Gold layers).
My main point is that some pipelines transform data from an old architecture into this new Medallion-based architecture, while others are developed exclusively within Databricks.
What do you think? Does this folder restructuring make sense?
src/
- green_inference_pipeline/ (only new in the new architecture)
- water_inference_pipeline/ (only new in the new architecture)
- 00_bronze_to_01_silver_pt1.py (this is the pipeline used for ingesting data from the old architecture to our new one structure, HOW TO REFACTOR IT?)
- 00_bronze_to_01_silver_pt2.py (this is the pipeline used for ingesting data from the old architecture to our new one structure, HOW TO REFACTOR IT?)
- etc. with silver->gold and gold->portal
What would be intuitive names for pipelines that start from the Bronze layer but process only one table on a scheduled basis, with different schedules for different Bronze pipelines?
For example:
- 00_bronze_fir_data_pipeline.py (runs daily at 1 AM)
- 00_bronze_tiny_data_pipeline.py (runs daily at 2 AM)
- 00_bronze_huge_data_pipeline.py (runs daily at 4 AM)
Do these naming conventions make sense, or would you suggest a more intuitive approach?