Hi everyone,
I’m working on an ML project in Databricks and want to design a clean, scalable, and production-ready project structure. I’d really appreciate guidance from those with real-world experience.
🔹 My Requirement
I need to organize my project with:
- A "config.yaml" file for managing parameters (paths, model configs, environment-specific settings, etc.)
- A "utils" module/package for reusable code (data loading, logging, validation, helpers)
- A "databricks.yaml" file (for asset bundles / deployment setup)
🔹 What I’m Looking For
I want to follow industry best practices for:
1. Structuring "config.yaml"
- How do you separate dev/stage/prod configs?
- Do you recommend a single config or multiple layered configs?
- How do you handle secrets (avoid hardcoding)?
2. Designing the "utils" layer
- What kind of functions/classes should go here vs elsewhere?
- How do you avoid making it a “dumping ground”?
- Any recommended folder structure?
3. Using "databricks.yaml"
- How should I structure it for multi-environment deployments?
- Best way to integrate with CI/CD pipelines?
- How do you manage job definitions and parameters cleanly?
4. Overall project structure
- Example folder structure for a production-grade ML project in Databricks
- How do you organize notebooks vs Python modules?
🔹 Context
- Using Databricks (Asset Bundles / Jobs)
- ML workflow (data preprocessing → training → evaluation → deployment)
- Looking for scalable, maintainable design (team collaboration friendly)
🔹 Bonus (if possible)
- Sample repo / GitHub reference
- Common mistakes to avoid
Thanks in advance! I’m especially interested in real-world patterns used in production, not just theoretical suggestions.