cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

CICD Folder structure for team of 10 Members

naveenbandla
New Contributor

Hi Everyone,

We are in the process of setting up a CI/CD framework for our Databricks ecosystem, and I have a general question around best practices.

We are a team of 10 members, and I’m trying to understand the ideal way to structure our repository and Databricks assets. I’ve gone through several blog posts, but I’m seeing mixed approaches.

Specifically:

  • Should we maintain a single top-level databricks.yml and deploy everything for every change?

  • Or is it better to organize assets project-wise (or domain-wise), each with its own configuration, so changes are scoped only to the relevant project?

I’d like to understand what is generally followed across companies and what has worked well in practice for scalability, collaboration, and controlled deployments.

Looking forward to your inputs and recommendations.

Thanks!

2 REPLIES 2

pradeep_singh
Contributor

 

If the work is owned by the same team, you can use a single databricks.yml. Each team member develops and tests their own resource locally, then commits to Git. At deployment time, you can either deploy all resources (using a wildcard) or target only the specific resources that changed. In development mode, resource names are automatically prefixed with the deploying user’s ID, which prevents naming conflicts across teammates so its safe and simpler to use a single databricks.yml

 
 
Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @naveenbandla,

This is a common decision point when adopting Databricks Asset Bundles (DABs), and the answer depends on how closely coupled your team's work is. Here is a breakdown of the two main patterns and when each works best.

OPTION 1: SINGLE REPO, SINGLE BUNDLE (MONOLITH)

Use one databricks.yml at the root with all resources defined (or split across included files).

When it works well:
- Your team of 10 shares a common domain (e.g., one data platform team)
- Resources have cross-dependencies (e.g., jobs that reference shared pipelines or libraries)
- You want a single deployment artifact per environment

A typical folder structure looks like:

my-project/
databricks.yml
resources/
  jobs/
    ingest_job.yml
    transform_job.yml
  pipelines/
    bronze_pipeline.yml
    silver_pipeline.yml
src/
  notebooks/
    ingest.py
    transform.py
  python/
    shared_utils/
      __init__.py
      helpers.py
tests/
  unit/
  integration/

Key points:
- Use the "include" mapping in databricks.yml to split resource definitions across multiple YAML files so the root file stays clean:

bundle:
name: my-project

include:
- "resources/jobs/*.yml"
- "resources/pipelines/*.yml"

targets:
dev:
  mode: development
  default: true
  workspace:
    host: https://your-dev-workspace.cloud.databricks.com
staging:
  workspace:
    host: https://your-staging-workspace.cloud.databricks.com
prod:
  mode: production
  workspace:
    host: https://your-prod-workspace.cloud.databricks.com
  run_as:
    service_principal_name: "cicd-service-principal"

- In "development" mode, DABs automatically prefixes all deployed resources with [dev <your_username>], so all 10 team members can deploy simultaneously without naming collisions.
- You can deploy selectively with "databricks bundle deploy -t dev -r my_specific_job" to avoid deploying everything on each change.

OPTION 2: SINGLE REPO, MULTIPLE BUNDLES (DOMAIN/PROJECT SPLIT)

Each project or domain gets its own subdirectory with its own databricks.yml. This is the recommended approach when teams or projects are more independent.

repo-root/
project-a/
  databricks.yml
  src/
  resources/
  tests/
project-b/
  databricks.yml
  src/
  resources/
  tests/
shared-libs/
  python/
    common_utils/

When it works well:
- Different team members own different projects or domains
- You want changes scoped to only the affected project (faster deploys, smaller blast radius)
- Projects have different deployment cadences or target different workspaces

In your CI/CD pipeline (GitHub Actions, Azure DevOps, etc.), you can detect which subdirectory changed and only deploy that bundle:

# GitHub Actions example (simplified)
jobs:
deploy:
  steps:
    - uses: actions/checkout@v4
    - uses: dorny/paths-filter@v3
      id: changes
      with:
        filters: |
          project-a:
            - 'project-a/**'
          project-b:
            - 'project-b/**'
    - if: steps.changes.outputs.project-a == 'true'
      run: |
        cd project-a
        databricks bundle deploy -t prod

RECOMMENDATION FOR A TEAM OF 10

For most teams of this size, a hybrid approach works well:

1. Start with a single bundle if the team shares one domain. The "include" feature keeps things modular, and dev mode prevents conflicts.

2. Split into separate bundles per project when you notice that unrelated changes are triggering full redeployments, or when sub-teams form around distinct workloads.

3. Use custom bundle templates to standardize folder structure across all projects. You can create a template and have every team member initialize new projects from it:

databricks bundle init /path/to/your/team-template

This ensures consistent naming, testing structure, and CI/CD configuration across all 10 members.

ADDITIONAL BEST PRACTICES

- Use service principals for production deployments. Never deploy to prod with personal credentials.
- Set "mode: production" on your prod target. This enforces validations like requiring run_as to be set and disabling cluster overrides.
- Use Git branch validation in your prod target to ensure only the main branch can deploy to production.
- Keep shared Python libraries in a dedicated folder and reference them via the "libraries" mapping in your job definitions.
- Use "databricks bundle validate" in your CI pipeline as a pre-merge check to catch configuration errors early.

DOCUMENTATION REFERENCES

- Databricks Asset Bundles overview: https://docs.databricks.com/aws/en/dev-tools/bundles/
- Bundle configuration (databricks.yml): https://docs.databricks.com/aws/en/dev-tools/bundles/settings.html
- CI/CD with Databricks Asset Bundles: https://docs.databricks.com/aws/en/dev-tools/bundles/ci-cd.html
- Deployment modes (dev vs production): https://docs.databricks.com/aws/en/dev-tools/bundles/deployment-modes.html
- Custom bundle templates: https://docs.databricks.com/aws/en/dev-tools/bundles/templates.html
- GitHub Actions for Databricks: https://github.com/databricks/setup-cli

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.