cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best-practice structure for config.yaml, utils, and databricks.yaml in ML project (Databricks)

AnilKumarM
New Contributor

Hi everyone,

 

I’m working on an ML project in Databricks and want to design a clean, scalable, and production-ready project structure. I’d really appreciate guidance from those with real-world experience.

 

🔹 My Requirement

 

I need to organize my project with:

 

- A "config.yaml" file for managing parameters (paths, model configs, environment-specific settings, etc.)

- A "utils" module/package for reusable code (data loading, logging, validation, helpers)

- A "databricks.yaml" file (for asset bundles / deployment setup)

 

🔹 What I’m Looking For

 

I want to follow industry best practices for:

 

1. Structuring "config.yaml"

   

   - How do you separate dev/stage/prod configs?

   - Do you recommend a single config or multiple layered configs?

   - How do you handle secrets (avoid hardcoding)?

 

2. Designing the "utils" layer

   

   - What kind of functions/classes should go here vs elsewhere?

   - How do you avoid making it a “dumping ground”?

   - Any recommended folder structure?

 

3. Using "databricks.yaml"

   

   - How should I structure it for multi-environment deployments?

   - Best way to integrate with CI/CD pipelines?

   - How do you manage job definitions and parameters cleanly?

 

4. Overall project structure

   

   - Example folder structure for a production-grade ML project in Databricks

   - How do you organize notebooks vs Python modules?

 

🔹 Context

 

- Using Databricks (Asset Bundles / Jobs)

- ML workflow (data preprocessing → training → evaluation → deployment)

- Looking for scalable, maintainable design (team collaboration friendly)

 

🔹 Bonus (if possible)

 

- Sample repo / GitHub reference

- Common mistakes to avoid

 

Thanks in advance! I’m especially interested in real-world patterns used in production, not just theoretical suggestions.

1 REPLY 1

Sumit_7
Honored Contributor

Hey Anil,

Though I don’t have direct experience in ML, since this question is primarily architectural, here’s my perspective:

1. Keep separate configs for separate env:
- base.yaml (default) and dev/stage/prod.yaml (env)
2. Avoid putting random helper or business logic in utils, instead keep:
- logging.py, constants.py, validation.py 
3. Follow separate workspace/targets for multi-env, check the MLOps- https://docs.databricks.com/aws/en/machine-learning/mlops/mlops-stacks
4. Check this for folder structure - https://academiatoindustry.substack.com/p/why-your-ml-project-looks-like-a

Hope this helps, thanks.