cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best place to manage terraform-provider-databricks and databricks cli

ajay_wavicle
Databricks Partner

I am trying to export and import files using terraform-provider-databricks and databricks cli. I am figuring out on how to manage the files without being running in local. whats the best practice to setup such migration. Can anyone help in established process for this.

2 REPLIES 2

Marc_Gibson96
Contributor

Hi Ajay,

This is something that you could set-up using any provider that supports CI/CD pipelines that you can install and run Databricks CLI commands on (e.g GitHub Actions, Azure DevOps, and so on) Historically, I have used Azure DevOps to drive this between Databricks and the IAC repository, but the pattern should remain the same regardless of what provider you use. ( Here is a great blog from Microsoft that sums up one approach: Deploy and Manage Azure Databricks Infrastructure using Terraform and Azure Devops pipeline )

If your use case permits it, I would recommend using Databricks Asset Bundles over using Terraform provider as this will open up more deployment options and provide the to deploy IaC from Databricks Workspace UI.

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @ajay_wavicle,

There are a couple of well-established patterns for managing Databricks resources with the Terraform provider and CLI without running everything locally. Here is a breakdown of the options and recommended approach.


WHERE TO RUN TERRAFORM AND THE DATABRICKS CLI

The key is to move execution into a CI/CD pipeline so that no one needs to run terraform apply or databricks CLI commands from their laptop. The most common options:

1. GitHub Actions
2. Azure DevOps Pipelines
3. GitLab CI/CD
4. Jenkins

In each case, the pipeline runner installs both the Terraform CLI and the Databricks CLI, authenticates using a service principal, and executes the commands on your behalf.


AUTHENTICATION FOR AUTOMATED PIPELINES

For CI/CD, use OAuth machine-to-machine (M2M) authentication with a Databricks service principal. This avoids personal access tokens and keeps credentials scoped and rotatable.

You will need:
- A Databricks service principal with the appropriate workspace permissions
- The client ID and client secret stored as pipeline secrets (not in code)
- Environment variables set in the pipeline:

export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_CLIENT_ID="<your-sp-client-id>"
export DATABRICKS_CLIENT_SECRET="<your-sp-client-secret>"

The Terraform provider picks these up automatically:

provider "databricks" {
# No hardcoded credentials needed; uses env vars
}

Docs: https://docs.databricks.com/aws/en/dev-tools/auth/index.html


TERRAFORM PROVIDER SETUP

1. Store your .tf files in a Git repository.
2. Use a remote backend for Terraform state (S3 + DynamoDB for AWS, Azure Blob Storage for Azure, or Terraform Cloud/HCP). This way, state is shared across the team and pipeline runs, not on anyone's local machine.
3. Structure your repo with modules for reusable components.

Example project layout:

my-databricks-infra/
main.tf
variables.tf
outputs.tf
modules/
workspace-config/
main.tf
unity-catalog/
main.tf
environments/
dev.tfvars
staging.tfvars
prod.tfvars

A minimal main.tf:

terraform {
required_providers {
databricks = {
source = "databricks/databricks"
}
}
backend "s3" {
bucket = "my-terraform-state"
key = "databricks/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-lock"
}
}

provider "databricks" {}

Databricks maintains an examples repository with CI/CD patterns for GitHub Actions and Azure DevOps:
https://github.com/databricks/terraform-databricks-examples

Look at the "manual-approve-with-github-actions" and "manual-approve-with-azure-devops" folders for ready-to-use pipeline templates.

Provider registry docs: https://registry.terraform.io/providers/databricks/databricks/latest/docs


DATABRICKS CLI IN CI/CD

Install the CLI in your pipeline with:

curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

Then use it for file operations, workspace sync, or running bundle commands. The same environment variables (DATABRICKS_HOST, DATABRICKS_CLIENT_ID, DATABRICKS_CLIENT_SECRET) authenticate the CLI automatically.

For file export/import specifically:

# Export a notebook
databricks workspace export /Users/someone/notebook.py ./local-copy.py

# Import a notebook
databricks workspace import ./local-copy.py /Users/someone/notebook.py

# Sync a local directory to the workspace
databricks sync ./src /Workspace/Users/someone/project --watch=false


RECOMMENDED APPROACH: DATABRICKS ASSET BUNDLES

If your goal is to manage and migrate Databricks resources (jobs, notebooks, pipelines) across environments, consider Databricks Asset Bundles (DABs). They combine the best of both worlds: you define resources as YAML configuration files in Git, and the Databricks CLI handles deployment.

DABs support:
- Multi-environment promotion (dev, staging, prod) through "targets"
- Service principal authentication for CI/CD
- Automatic resource naming and isolation in dev mode
- GitHub Actions integration for automated deployments

A quick example of a databricks.yml:

bundle:
name: my_project

targets:
dev:
mode: development
default: true
workspace:
host: https://dev-workspace.cloud.databricks.com

prod:
mode: production
workspace:
host: https://prod-workspace.cloud.databricks.com
run_as:
service_principal_name: prod-deployer@company.com

Deploy with:

databricks bundle deploy --target prod

Docs: https://docs.databricks.com/aws/en/dev-tools/bundles/index.html
Deployment modes: https://docs.databricks.com/aws/en/dev-tools/bundles/deployment-modes.html


WHEN TO USE TERRAFORM VS. ASSET BUNDLES

- Use the Terraform provider for workspace-level infrastructure: creating workspaces, configuring Unity Catalog, managing IAM roles, setting up networking, and provisioning cloud resources.
- Use Databricks Asset Bundles for application-level resources: deploying jobs, notebooks, pipelines, and ML experiments across environments.
- Many teams use both: Terraform for the "platform layer" and DABs for the "application layer."


SAMPLE GITHUB ACTIONS WORKFLOW

Here is a simplified GitHub Actions workflow that runs Terraform without any local execution:

name: Deploy Databricks Infra
on:
push:
branches: [main]

jobs:
deploy:
runs-on: ubuntu-latest
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}

steps:
- uses: actions/checkout@v4

- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.9.0

- run: terraform init
- run: terraform plan -var-file=environments/prod.tfvars
- run: terraform apply -auto-approve -var-file=environments/prod.tfvars

This keeps everything in version control and running in the cloud, with no local execution required.

Hope this helps you get set up. Let me know if you have follow-up questions about any of these patterns.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.