Databricks Community

SashankKotta · ‎08-27-2024

CICD with Databricks Asset Bundles, Workflows and Azure DevOps

In this article you will learn how to set up Databricks Workflows with CI/CD. There are two essential components needed for a complete CI/CD setup of Workflow jobs.

Databricks Asset Bundles (DABs): https://learn.microsoft.com/en-us/azure/databricks/dev-tools/bundles/
AzureDevOps pipeline.

Getting started with Databricks Asset Bundles

We can use Databricks Asset Bundle(DABs) with the Databricks CLI from any terminal to deploy Workflows. Please note that Databricks Asset Bundles (DABs) are available in the latest version of databricks-cli (v0.205.0 and above). The legacy version will not work.

Installation of Databricks-cli in local terminal:
- Run the below commands to install databricks-cli in the local terminal:
- For more details, follow the below link to install databrick cli(v0.205 and above): https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/install#--curl-installation-for-lin...
```
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
```

Authenticate your databricks cli to dev workspace(Using PAT):
- Follow the steps from the below link to authenticate local terminal to dev workspace using personal access token PAT: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/authentication#--azure-databricks-p...

Authenticate your databricks cli to dev workspace(Using Service principal):
- Follow the steps from the below link to authenticate local terminal to dev workspace using a Service principal: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/authentication#--microsoft-entra-id...

From your local terminal, run the following command:

databricks bundle init

When prompted, select Python project and provide a project name (e.g., demo_wf). After you complete the prompts, a folder will be generated with the project name containing all the components needed for a Workflow, as shown below.

Next, navigate to the project directory.

cd demo_wf

We need the notebooks in .ipynb format inside the src folder. These notebook files will be the respective tasks in the Workflow. We can also create DLT pipelines and libraries as individual tasks.

Inside the resources folder, we will have a YAML file called <project_name>_job.yml (in our case, demo_wf_job.yml). This file defines the task flow:

tasks:
  - task_key: task1
    job_cluster_key: job_cluster
    notebook_task:
      notebook_path: ../src/notebook_1.ipynb
  - task_key: task2
    job_cluster_key: job_cluster
    notebook_task:
      notebook_path: ../src/notebook_2.ipynb
    depends_on:
- task_key: task1

After navigating to the project directory (demo_wf), run the following command to catch any syntax errors prior to deployment.

databricks bundle validate

Finally, run the command to deploy the Workflow in development mode.

databricks bundle deploy -t dev

At this point, you have a sample project with a Workflow deployed to your Databricks Workspace. The same commands can be run from a build pipeline in Azure DevOps, and that will complete the CICD setup.

Using DABs in Azure DevOps pipelines:

To begin, we need an Azure virtual machine to run commands as an agent for our DevOps pipeline. Create a virtual machine in Azure, assign a Network security group, and set inbound rules to allow SSH (port 22) from your IP address so you can connect using SSH and do the setup on this virtual machine. While creating the VM, we will be asked to download a .pem file - keep it safe as it is needed while connecting to the VM through an SSH.

The next step is to install the databricks-cli on this VM and configure this machine as an agent for your Azure agent pool. If you have setup the inbound networking rules correctly, you can connect to the VM using the command:

ssh -i <path_to_pem>/<file_name>.pem <username>@<hostname>

Now install the databricks-cli:

curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh

If you get any errors concerning unzip, then please install unzip using the below commands and re-run the above curl command:

sudo apt update -y
sudo apt install unzip -y

Next, configure the VM to run as an agent in the Azure agent pool by following the steps below.

Create a Self-hosted Agent in Azure DevOps: To create a self-hosted agent, go to Project Settings (Bottom left) and select the Agent pools option under the Pipelines section. Press the Add pool button and configure the agent:

On the top-right corner, press the New Agent button. You can create Windows, macOS, and Linux agents. Based on your VM, select the appropriate OS, then follow the instructions; in our case, we used an Ubuntu image so that it will be a Linux agent.
- Connect to the VM using the SSH
- Download the agent then extract it to a folder using the below linux commands in the VM terminal.
```
#Create a directory named myagent
mkdir myagent 

#Navigate to that agent.
cd myagent 

#Download the linux agent zip from the link given in the instructions using linux
wget https://vstsagentpackage.azureedge.net/agent/3.236.1/vsts-agent-linux-x64-3.236.1.tar.gz 

#Unzip the agent file from the downloaded zip
tar zxvf ~/Downloads/vsts-agent-linux-x64-3.236.1.tar.gz 
```
Configure the agent by running the config script in the VM terminal:
```
./config.sh
```
- Fill in the following prompts:
  - Server URL: Copy and paste the organization URL, which looks like the following: https://dev.azure.com/<my-organization-name>
  - Personal Access Token (PAT): Go to the Personal Access Tokens option under the User Settings icon. Ensure you generate a PAT with Read & manage access to the Agent pools.
  - Agent pool name: The newly created pool, which is the my-demo-pool in our case
  - Agent Name: Give a meaningful name or stay with the default
  - Work folder: Press enter for the default
  - Agent as Service: Press enter to use the default.
Run the agent by executing the run script.
- ```
./run.sh
```
- Once done, you can see that the Agent is up and running under the Agents panel. The self-hosted agent is connected to Azure DevOps and listens for new jobs.
- For more details, refer to the link: Self-hosted agent
Now that the agent VM is configured, the next step is to create azure-pipelines.yml (DevOps pipeline). Yaml should look like the one below.

# Starter pipeline
# Start with a minimal pipeline that you can customize to build and deploy your code.
# Add steps that build, run tests, deploy, and more:
# https://aka.ms/yaml
trigger:
- main
pool: my-demo-pool
steps:
- script: echo "Hello, world!"
  displayName: 'Run a one-line script'
- script: |
    echo Add other tasks to build, test, and deploy your project.
    echo See https://aka.ms/yaml
  displayName: 'Run a multi-line script'
- task: Bash@3
  inputs:
    targetType: 'inline'
    script: |
      # Write your commands here
      echo 'Hello world'

      touch ~/.databrickscfg
      echo "[DEFAULT]" > ~/.databrickscfg
      echo "host = <workspace_host_url>" >> ~/.databrickscfg
      echo "azure_workspace_resource_id = <Azure_sp_resource_id>" >> ~/.databrickscfg
      echo "azure_tenant_id = <tenant_id>" >> ~/.databrickscfg
      echo "azure_client_id = <spn_client_id>" >> ~/.databrickscfg
      echo "azure_client_secret =<client_secret>" >> ~/.databrickscfg
      cat ~/.databrickscfg

      databricks bundle validate
      databricks bundle deploy -t dev

To ensure our CI/CD is working as expected, the VM Agent should be up and running under the Agents panel (Project settings > Agent pools > Agents tab). The folder structure in the main branch of Azure DevOps should look this:

Any changes to the Azure DevOps main branch should be deployed/reflected in the Workflow jobs of your Databricks Workspace.

Conclusion:

In conclusion, setting up Databricks Workflows with CI/CD involves two key components: Databricks Asset Bundles (DABs) and an Azure DevOps pipeline. By using DABs with the Databricks CLI, you can easily deploy workflows from any terminal. Integrating this setup with Azure DevOps requires configuring a virtual machine as an agent, installing necessary tools, and creating a pipeline to automate deployments, ensuring seamless updates to your Databricks Workflows.

Databricks Community

CI/CD Integration with Databricks Workflows

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

Metadata-Driven ETL Framework in Databricks (Part-1)