To properly version and deploy Databricks workflows—including orchestration, dependencies, and environment management—across Dev, Test, and Prod using Azure DevOps, follow these best practices and patterns:
Versioning Databricks Workflows
-
Store Databricks notebooks, scripts, and workflow configuration (such as Databricks Asset Bundles with databricks.yml) in a Git repository.
-
Use structured folders: separate directories for notebooks, scripts, configuration files, and bundle definitions.
-
Commit each change (workflow logic, dependencies, orchestration configs) to version control and use Git branches for environment-specific workflows and stable releases.
Automated CI/CD with Azure DevOps
Core Approach:
-
Use Databricks Asset Bundles (DABs) and the Databricks CLI for defining and deploying jobs and their orchestration in code.
-
Set up two separate pipelines in Azure DevOps: a build pipeline (prepares artifacts) and a release pipeline (deploys and runs workflows).
Steps for End-to-End CI/CD
1. Organize Assets in Git
-
Put all notebooks, custom scripts/packages, and databricks.yml in your repository.
-
The databricks.yml bundle describes jobs, dependencies, clusters, variables, and targets (for Dev, Test, Prod) in a declarative YAML format.
2. Define Build Pipeline
-
Pipeline pulls latest artifacts from Git, runs tests, packages any libraries (like Python wheels), and creates a zipped deployment artifact.
-
Example YAML task (simplified):
trigger:
- release
pool:
vmImage: ubuntu-latest
steps:
- checkout: self
- script: |
# Custom logic to prepare artifacts
mkdir -p $(Build.ArtifactStagingDirectory)
cp -R * $(Build.ArtifactStagingDirectory)/
displayName: "Prepare Artifacts"
- task: PublishBuildArtifacts@1
inputs:
ArtifactName: 'DatabricksBuild'
-
Store pipeline YAML with the repo for versioning.
3. Define Release Pipeline
-
Unpacks and deploys the artifact using the Databricks CLI.
-
Uses environment variables to switch between deployment targets; for example:
databricks bundle deploy -t dev
databricks bundle deploy -t test
databricks bundle deploy -t prod
-
Executes Databricks jobs or job runs after deployment, for validation or smoke tests.
-
Securely configure service principal/application credentials for Databricks API access.
4. Promote Artifacts Between Environments
-
Artifacts tested in Dev are promoted to Test and then Prod by reusing the same release pipeline with different target parameters, ensuring consistency and immutability.
Workflow Configuration Example (databricks.yml)
-
Example bundle for jobs/dependencies:
bundle:
name: my-workflow
targets:
dev:
mode: development
workspace:
host: https://adb-xxx.azuredatabricks.net
prod:
mode: production
workspace:
host: https://adb-yyy.azuredatabricks.net
resources:
jobs:
my-job:
name: My Workflow Job
tasks:
- notebook_task:
notebook_path: ./notebooks/my_notebook.py
new_cluster:
spark_version: "13.3.x-scala2.12"
node_type_id: Standard_DS3_v2
-
Switch deployment environment via CLI/DevOps pipeline variable: databricks bundle deploy -t prod.
Key Tools and Tips
-
Use Databricks CLI in non-interactive mode within pipelines for deployments and validation.
-
Parameterize cluster/node/job settings via YAML for each environment.
-
Test locally with the CLI and validate your workflow syntax before running through the pipeline.
-
Store secrets (Azure Databricks token, service principal credentials) securely using Azure DevOps secrets.
This approach offers clear, reproducible promotion of workflow definitions, orchestration, dependencies, and environment settings, fully automated within Azure DevOps and under version control.
For a detailed Microsoft walkthrough with sample files and pipeline YAML, see the official documentation.