Databricks Community

mkEngineer · ‎02-19-2025

Hi everyone,

I’m trying to set up versioning and CI/CD for my Databricks workflows using Azure DevOps and Git. While I’ve successfully versioned notebooks in a Git repo, I’m struggling with handling workflows (which define orchestration, dependencies, schema, etc.).

How can I properly version and deploy Databricks workflows across different environments (Dev, Test, Prod) using Azure DevOps?

Thanks in advance!

mkEngineer · ‎02-19-2025

@Alberto_Umana and @nicole_lu_PM maybe you have a clue? Would DABs be useful in this case?

mkEngineer · ‎02-19-2025

As of now, my current approach is to manually copy/paste YAMLs across workspaces and version them using Git/Azure DevOps by saving them as DBFS files. The CD process is then handled using Databricks DBFS File Deployment by Data Thirst Ltd.

While this works, I’m still looking for a more automated and scalable solution. Has anyone found a better way to manage Databricks workflow versioning and deployment in a CI/CD setup? Would love to hear your insights!

mark_ott · 4 weeks ago

To properly version and deploy Databricks workflows—including orchestration, dependencies, and environment management—across Dev, Test, and Prod using Azure DevOps, follow these best practices and patterns:

Versioning Databricks Workflows

Store Databricks notebooks, scripts, and workflow configuration (such as Databricks Asset Bundles with databricks.yml) in a Git repository.
Use structured folders: separate directories for notebooks, scripts, configuration files, and bundle definitions.
Commit each change (workflow logic, dependencies, orchestration configs) to version control and use Git branches for environment-specific workflows and stable releases.

Automated CI/CD with Azure DevOps

Core Approach:

Use Databricks Asset Bundles (DABs) and the Databricks CLI for defining and deploying jobs and their orchestration in code.
Set up two separate pipelines in Azure DevOps: a build pipeline (prepares artifacts) and a release pipeline (deploys and runs workflows).

Steps for End-to-End CI/CD

1. Organize Assets in Git

Put all notebooks, custom scripts/packages, and databricks.yml in your repository.
The databricks.yml bundle describes jobs, dependencies, clusters, variables, and targets (for Dev, Test, Prod) in a declarative YAML format.

2. Define Build Pipeline

Pipeline pulls latest artifacts from Git, runs tests, packages any libraries (like Python wheels), and creates a zipped deployment artifact.
Example YAML task (simplified):

text

trigger: - release pool: vmImage: ubuntu-latest steps: - checkout: self - script: | # Custom logic to prepare artifacts mkdir -p $(Build.ArtifactStagingDirectory) cp -R * $(Build.ArtifactStagingDirectory)/ displayName: "Prepare Artifacts" - task: PublishBuildArtifacts@1 inputs: ArtifactName: 'DatabricksBuild'
Store pipeline YAML with the repo for versioning.

3. Define Release Pipeline

Unpacks and deploys the artifact using the Databricks CLI.
Uses environment variables to switch between deployment targets; for example:

text

databricks bundle deploy -t dev databricks bundle deploy -t test databricks bundle deploy -t prod
Executes Databricks jobs or job runs after deployment, for validation or smoke tests.
Securely configure service principal/application credentials for Databricks API access.

4. Promote Artifacts Between Environments

Artifacts tested in Dev are promoted to Test and then Prod by reusing the same release pipeline with different target parameters, ensuring consistency and immutability.

Workflow Configuration Example (databricks.yml)

Example bundle for jobs/dependencies:

text

bundle: name: my-workflow targets: dev: mode: development workspace: host: https://adb-xxx.azuredatabricks.net prod: mode: production workspace: host: https://adb-yyy.azuredatabricks.net resources: jobs: my-job: name: My Workflow Job tasks: - notebook_task: notebook_path: ./notebooks/my_notebook.py new_cluster: spark_version: "13.3.x-scala2.12" node_type_id: Standard_DS3_v2
Switch deployment environment via CLI/DevOps pipeline variable: databricks bundle deploy -t prod.

Key Tools and Tips

Use Databricks CLI in non-interactive mode within pipelines for deployments and validation.
Parameterize cluster/node/job settings via YAML for each environment.
Test locally with the CLI and validate your workflow syntax before running through the pipeline.
Store secrets (Azure Databricks token, service principal credentials) securely using Azure DevOps secrets.

This approach offers clear, reproducible promotion of workflow definitions, orchestration, dependencies, and environment settings, fully automated within Azure DevOps and under version control.

For a detailed Microsoft walkthrough with sample files and pipeline YAML, see the official documentation.