Hey @Suheb , I teach a lot of our machine learning training, and over time Iโve talked with many students, customers, and partners about how they approach this. The answers are all over the map, which tells you thereโs no single โgolden ruleโ that fits every team or use case.
That said, Databricks does have a point of view here, and I wanted to share that perspective with you.
Hereโs a practical, opinionated plan you can suggest that keeps machine-learning files, notebooks, and code organized in Databricks using Git folders, MLflow, and Unity Catalog.
1) Organizing principles and workspace layout
-
Use Databricks Git folders to mirror your remote Git repository in the workspace. Clone under each developerโs /Workspace/Users/ and have them work on their own feature branch so theyโre not stepping on each otherโs changes.
-
Establish a Production Git folder (admin-owned) for read-only execution and automation. Developers merge via PRs, and automation pulls into production.
-
Keep shared artifacts (dashboards, experiments created outside repos, etc.) in Shared. Keep personal scratch work in Users. Manage all objects in the Workspace browser and set permissions at the folder level.
2) Repository (Git folder) structure
project-root/
โโโ notebooks/ # Exploratory & orchestrating notebooks
โ โโโ eda/
โ โโโ training/
โ โโโ inference/
โโโ src/ # Reusable Python/R modules imported by notebooks
โ โโโ mypkg/...
โโโ tests/ # Unit/integration tests (pytest, etc.)
โโโ resources/ # Bundles YAML: jobs, pipelines, clusters
โ โโโ workflows/
โ โโโ clusters/
โโโ databricks.yml # Databricks Asset Bundles definition
โโโ requirements.txt # Runtime deps (or pyproject.toml)
โโโ README.md
3) Notebooks vs files: modularization and tests
-
Modularize notebooks by pushing reusable logic into files in the repo and importing them. This enables unit testing and makes code review far cleaner.
-
Add unit/integration tests that can run in notebooks or via the web terminal. Use %run for shared test notebooks when it makes sense, or call modules directly from pytest for the โrealโ software engineering workflow.
4) Experiments, models, and artifacts
-
Track runs in MLflow experiments. For team sharing, prefer workspace experiments stored in a shared workspace folder (not just notebook-scoped experiments).
-
Register models in Models in Unity Catalog (MLflow Model Registry integrated with UC) for cross-workspace access, governance, lineage, and aliases. Bonus: the registry UX makes comparisons and promotion workflows much more natural.
-
Be aware of Git folder limitations: workspace MLflow experiments cannot be created inside Git folders. Log to a shared workspace folder and keep notebooks in the repo.
5) Data and non-tabular files
-
Store data as Unity Catalog tables (for tabular data), and use Unity Catalog volumes for non-tabular files (configs, small sample CSVs, build artifacts, wheels). This keeps everything governed, discoverable, and avoids a jungle of ad hoc mount paths.
6) Permissions and collaboration
-
Manage permissions with folder ACLs. Notebooks and experiments inherit the folderโs permissions. Give edit rights in Users folders, and controlled run/edit rights in Shared and production folders.
-
Collaborate in notebooks with comments, but use folders to enforce consistent permissions across related assets.
7) CI/CD, environments, and automation
-
Use Databricks Asset Bundles to version jobs, pipelines, clusters, and code together. Validate and deploy across dev/staging/prod using your cloudโs CI/CD system (GitHub Actions, Azure DevOps, etc.).
-
Maintain separate workspaces (dev/staging/prod) or, at minimum, separate folders and branches mapped to environments. Use PRs and automated checks before promotion.
-
Git folders integrate cleanly with CI/CD. Admins configure production folders and automation pulls merged changes into those folders.
๐ Limits and gotchas (plan for them)
-
Git folder limits include working branch size (around 1 GB), per-operation memory/disk limits, and practical caps on total assets. Avoid monorepos in Git folders because performance and operability degrade fast.
-
Incoming changes that alter source code will clear notebook state (outputs, comments, widgets). Align jobs to run from Git commits/tags (not workspace paths) to keep runs deterministic.
9) Naming conventions and paths
10) Discovery and lineage (optional but recommended)
-
Lean on search for notebooks, repos, volumes, and UC models/tables across the workspace. This helps teams find โthe right thingโ without tribal knowledge.
-
Use Unity Catalog lineage for column-level provenance across notebooks, jobs, and dashboards. It pays dividends during reviews, audits, and โwhy did this model change?โ conversations.
Why this plan works in Databricks
-
Git folders bring Git operations directly into the workspace and create a clean collaboration model (clone per-user, feature branches, PRs, controlled promotion).
-
It supports real software engineering practices without pretending notebooks should be your entire codebase: modularity, testing, reviewability, CI/CD.
-
Asset Bundles unify infra + code promotion, while MLflow centralizes runs and UC Model Registry centralizes governed models. Net result: ML development that behaves like a disciplined software lifecycle.
Quick checklist you can give teams
-
Clone your repo as a Git folder in your user directory. Create a feature branch and PR your changes.
-
Put shared code in src/, notebooks in notebooks/, tests in tests/, and environment configs/workflows in resources/ with databricks.yml managed by Bundles.
-
Log runs to shared MLflow experiments, register models in UC Model Registry, and store non-tabular artifacts in UC volumes.
-
Use folder ACLs for permissions. Avoid monorepos and keep repo sizes within Git folder limits.
Hope this helps.
Cheers, Louis.