cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best-practice structure for config.yaml, utils, and databricks.yaml in ML project (Databricks)

AnilKumarM
New Contributor

Hi everyone,

 

I’m working on an ML project in Databricks and want to design a clean, scalable, and production-ready project structure. I’d really appreciate guidance from those with real-world experience.

 

🔹 My Requirement

 

I need to organize my project with:

 

- A "config.yaml" file for managing parameters (paths, model configs, environment-specific settings, etc.)

- A "utils" module/package for reusable code (data loading, logging, validation, helpers)

- A "databricks.yaml" file (for asset bundles / deployment setup)

 

🔹 What I’m Looking For

 

I want to follow industry best practices for:

 

1. Structuring "config.yaml"

   

   - How do you separate dev/stage/prod configs?

   - Do you recommend a single config or multiple layered configs?

   - How do you handle secrets (avoid hardcoding)?

 

2. Designing the "utils" layer

   

   - What kind of functions/classes should go here vs elsewhere?

   - How do you avoid making it a “dumping ground”?

   - Any recommended folder structure?

 

3. Using "databricks.yaml"

   

   - How should I structure it for multi-environment deployments?

   - Best way to integrate with CI/CD pipelines?

   - How do you manage job definitions and parameters cleanly?

 

4. Overall project structure

   

   - Example folder structure for a production-grade ML project in Databricks

   - How do you organize notebooks vs Python modules?

 

🔹 Context

 

- Using Databricks (Asset Bundles / Jobs)

- ML workflow (data preprocessing → training → evaluation → deployment)

- Looking for scalable, maintainable design (team collaboration friendly)

 

🔹 Bonus (if possible)

 

- Sample repo / GitHub reference

- Common mistakes to avoid

 

Thanks in advance! I’m especially interested in real-world patterns used in production, not just theoretical suggestions.

3 REPLIES 3

Sumit_7
Honored Contributor

Hey Anil,

Though I don’t have direct experience in ML, since this question is primarily architectural, here’s my perspective:

1. Keep separate configs for separate env:
- base.yaml (default) and dev/stage/prod.yaml (env)
2. Avoid putting random helper or business logic in utils, instead keep:
- logging.py, constants.py, validation.py 
3. Follow separate workspace/targets for multi-env, check the MLOps- https://docs.databricks.com/aws/en/machine-learning/mlops/mlops-stacks
4. Check this for folder structure - https://academiatoindustry.substack.com/p/why-your-ml-project-looks-like-a

Hope this helps, thanks.

-werners-
Esteemed Contributor III

IMO there is not such thing as a best practice as there are many possibilities.
One may work in one company but not in the other.
F.e. we are a small team and use a monorepo + 2 databricks workspaces with a shared UC metastore.
What he have built here is probably not the way to go for large teams or companies.

One thing that is very important to know beforehand is this:
in case you have multiple workspaces, do they share the same UC metastore or not?
If not, you have to make your code workspace-aware concerning table names (you will have a schema or catalog for dev, qa and prod).

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @AnilKumarM,

Agree with @-werners- here. There isn’t a single 'one true' repo layout we mandate, but there are a few public references that show the patterns Databricks recommends.

For bundles/databricks.yml + multi‑env, you may want to check the Declarative Automation Bundles (DABs) for the concept and YAML structure as a starting point. The reference provided by @Sumit_7 for MLOps Stacks is also very good. It is an opinionated ML project template built on bundles, including repo layout, bundle config, and CI/CD. You can also look at this to understand how to scaffold a stack project.

Those docs + the repo effectively give you a reference implementation for where databricks.yml lives (project root) and how to define targets/resources... how to separate ML code (src/... and notebooks) from resource YAML (resources/...) and how to structure env‑specific config inside the bundle rather than hard‑coding it.

For CI/CD and repo structure more generally, try this link. It gives some patterns for "single repo with code + bundle config" vs. "separate repos", with concrete examples.

For code vs. utils vs. notebooks, this page walks through putting notebooks in Git, extracting shared code into modules, and testing it.

Taken together, these do not specify your exact config.yaml / utils layout, but they do illustrate the structures Databricks uses internally for production ML projects and how to connect that to databricks.yml and CI/CD.

I hope this provides some guidance.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***