cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How do you organize ML projects in Databricks workspaces?

Suheb
Contributor

How do you keep your machine-learning files, notebooks, and code properly organized in Databricks?

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hey @Suheb , I teach a lot of our machine learning training, and over time Iโ€™ve talked with many students, customers, and partners about how they approach this. The answers are all over the map, which tells you thereโ€™s no single โ€œgolden ruleโ€ that fits every team or use case.

That said, Databricks does have a point of view here, and I wanted to share that perspective with you.

Hereโ€™s a practical, opinionated plan you can suggest that keeps machine-learning files, notebooks, and code organized in Databricks using Git folders, MLflow, and Unity Catalog.

1) Organizing principles and workspace layout

  • Use Databricks Git folders to mirror your remote Git repository in the workspace. Clone under each developerโ€™s /Workspace/Users/ and have them work on their own feature branch so theyโ€™re not stepping on each otherโ€™s changes.

  • Establish a Production Git folder (admin-owned) for read-only execution and automation. Developers merge via PRs, and automation pulls into production.

  • Keep shared artifacts (dashboards, experiments created outside repos, etc.) in Shared. Keep personal scratch work in Users. Manage all objects in the Workspace browser and set permissions at the folder level.

2) Repository (Git folder) structure

  • Use a clear, modular repo layout so notebooks donโ€™t become monoliths. A lightweight pattern that works well in Databricks and CI/CD:

 

project-root/
โ”œโ”€โ”€ notebooks/                 # Exploratory & orchestrating notebooks
โ”‚   โ”œโ”€โ”€ eda/
โ”‚   โ”œโ”€โ”€ training/
โ”‚   โ””โ”€โ”€ inference/
โ”œโ”€โ”€ src/                       # Reusable Python/R modules imported by notebooks
โ”‚   โ””โ”€โ”€ mypkg/...
โ”œโ”€โ”€ tests/                     # Unit/integration tests (pytest, etc.)
โ”œโ”€โ”€ resources/                 # Bundles YAML: jobs, pipelines, clusters
โ”‚   โ”œโ”€โ”€ workflows/
โ”‚   โ””โ”€โ”€ clusters/
โ”œโ”€โ”€ databricks.yml             # Databricks Asset Bundles definition
โ”œโ”€โ”€ requirements.txt           # Runtime deps (or pyproject.toml)
โ””โ”€โ”€ README.md
  • Keep code you share across notebooks in .py modules under src/ and import them from notebooks for modularity and testability.

3) Notebooks vs files: modularization and tests

  • Modularize notebooks by pushing reusable logic into files in the repo and importing them. This enables unit testing and makes code review far cleaner.

  • Add unit/integration tests that can run in notebooks or via the web terminal. Use %run for shared test notebooks when it makes sense, or call modules directly from pytest for the โ€œrealโ€ software engineering workflow.

4) Experiments, models, and artifacts

  • Track runs in MLflow experiments. For team sharing, prefer workspace experiments stored in a shared workspace folder (not just notebook-scoped experiments).

  • Register models in Models in Unity Catalog (MLflow Model Registry integrated with UC) for cross-workspace access, governance, lineage, and aliases. Bonus: the registry UX makes comparisons and promotion workflows much more natural.

  • Be aware of Git folder limitations: workspace MLflow experiments cannot be created inside Git folders. Log to a shared workspace folder and keep notebooks in the repo.

5) Data and non-tabular files

  • Store data as Unity Catalog tables (for tabular data), and use Unity Catalog volumes for non-tabular files (configs, small sample CSVs, build artifacts, wheels). This keeps everything governed, discoverable, and avoids a jungle of ad hoc mount paths.

6) Permissions and collaboration

  • Manage permissions with folder ACLs. Notebooks and experiments inherit the folderโ€™s permissions. Give edit rights in Users folders, and controlled run/edit rights in Shared and production folders.

  • Collaborate in notebooks with comments, but use folders to enforce consistent permissions across related assets.

7) CI/CD, environments, and automation

  • Use Databricks Asset Bundles to version jobs, pipelines, clusters, and code together. Validate and deploy across dev/staging/prod using your cloudโ€™s CI/CD system (GitHub Actions, Azure DevOps, etc.).

  • Maintain separate workspaces (dev/staging/prod) or, at minimum, separate folders and branches mapped to environments. Use PRs and automated checks before promotion.

  • Git folders integrate cleanly with CI/CD. Admins configure production folders and automation pulls merged changes into those folders.

๐Ÿ˜Ž Limits and gotchas (plan for them)

  • Git folder limits include working branch size (around 1 GB), per-operation memory/disk limits, and practical caps on total assets. Avoid monorepos in Git folders because performance and operability degrade fast.

  • Incoming changes that alter source code will clear notebook state (outputs, comments, widgets). Align jobs to run from Git commits/tags (not workspace paths) to keep runs deterministic.

9) Naming conventions and paths

  • Use consistent folder paths and naming. Clone repos under /Workspace/Users/<your-email>/<project> and keep environment naming aligned across jobs and bundles.

10) Discovery and lineage (optional but recommended)

  • Lean on search for notebooks, repos, volumes, and UC models/tables across the workspace. This helps teams find โ€œthe right thingโ€ without tribal knowledge.

  • Use Unity Catalog lineage for column-level provenance across notebooks, jobs, and dashboards. It pays dividends during reviews, audits, and โ€œwhy did this model change?โ€ conversations.

Why this plan works in Databricks

  • Git folders bring Git operations directly into the workspace and create a clean collaboration model (clone per-user, feature branches, PRs, controlled promotion).

  • It supports real software engineering practices without pretending notebooks should be your entire codebase: modularity, testing, reviewability, CI/CD.

  • Asset Bundles unify infra + code promotion, while MLflow centralizes runs and UC Model Registry centralizes governed models. Net result: ML development that behaves like a disciplined software lifecycle.

Quick checklist you can give teams

  • Clone your repo as a Git folder in your user directory. Create a feature branch and PR your changes.

  • Put shared code in src/, notebooks in notebooks/, tests in tests/, and environment configs/workflows in resources/ with databricks.yml managed by Bundles.

  • Log runs to shared MLflow experiments, register models in UC Model Registry, and store non-tabular artifacts in UC volumes.

  • Use folder ACLs for permissions. Avoid monorepos and keep repo sizes within Git folder limits.

 

Hope this helps.

Cheers, Louis.