In today's fast-paced development environments, DevOps or platform teams need to ensure consistency, scalability, and efficiency in managing data pipeline deployments. Infrastructure-as-Code (IaC) tools allow teams to automate the provisioning and management of resources, reducing manual errors and enhancing collaboration across teams.
In the realm of Databricks Data Intelligence Platform, while Terraform is a predominant tool to automate cloud infrastructure creation, maintenance, and evolution, its steep learning curve became an obstacle to data practitioners (data scientists, data engineers, ML engineers) who also want to automate and scale their workloads. That was the motivation for Databricks to introduce Databricks Asset Bundles, which are increasingly being adopted across Databricks customers to build CICD pipelines and deploy project-specific resources. Databricks Asset Bundles provide users with a standard holistic way to deploy projects on Databricks. They allow user isolation during development, environment overrides for different resources as well as taking care of the different API calls needed to publish objects like code, cluster configs, and jobs. This greatly reduces the effort to deploy and manage a Databricks project in production.
This blog is to introduce the following 5 practical tricks, for targeting data practitioners who are using Databricks Asset Bundles:
To develop a bundle, we will go through a few steps, from preparing workload artifacts like notebooks to authoring a bundle configuration file. We may introduce misspellings, and syntax errors during these steps. To get quick feedback if the bundle is properly set up, we need to run the validate command, which plays a crucial role in ensuring that the asset bundles' configuration file is syntactically correct. We regularly run this command before deployment to catch any potential errors early. If the validation passes without issue, it gives us the confidence to proceed with the deploy command seamlessly.
But there's more to the validate command than just syntax checks. It also provides a comprehensive, parameterized template of the resources set to be deployed, offering a clear picture of what will be instantiated. Much like the terraform plan command, we can even customize the output by specifying various options, as shown below:
databricks bundle validate -t dev -o json
This command generates the resource definitions targeting the development environment. Optionally, you can export the output as a JSON file, which can then be shared with support teams for debugging purposes:
databricks bundle validate -t dev -o json > dab_dev.json
It’s important to note that, while Databricks Asset Bundles are built on Terraform, the output doesn't specifically indicate which changes will be applied compared to the previous version (unlike terraform plan). To track and review changes effectively - especially those related to artifacts like pipeline source code—we highly recommend using a version control system like GitHub. This practice enhances transparency and helps teams stay informed about updates.
Databricks offers the flexibility to develop code and schedule jobs through its intuitive web portal or popular IDE plugins. However, for a more scalable and efficient workflow, leveraging Databricks Asset Bundles for managing job deployments is a more robust solution.
Migrating an existing job or pipeline to Databricks Asset Bundles can initially seem overwhelming, especially with the complexity of jobs created in the workspace – ranging from various task types to intricate code dependencies. Although you can easily copy and paste job definitions from the YAML view in the UI into a bundle, it's important to adjust the code paths to the relative paths used by the bundles.
Fortunately, the creators of Asset Bundles have streamlined this process. Instead of manually copying job definitions into your bundle folder, you can use a built-in command to automatically generate bundles from existing jobs in your workspace. By simply running the following command in a folder where you’ve already set up a Databricks Asset Bundle, the job definition and related code will be transferred directly into your bundle folder.
databricks bundle generate job --existing-job-id job_id
Locating the job ID is straightforward – you can either find it in the Workflows UI or retrieve it using the Databricks CLI. Once you have the job ID, running the command will generate a corresponding YAML file that contains the workflow definition, along with the job's associated code. The command also ensures that all code references are updated to use the correct paths. At this point, your workflow is ready to be seamlessly deployed using Databricks Asset Bundles.
Bundle templates provide a consistent and repeatable method for creating bundles by defining folder structures, build steps, tasks, tests, and so on. They specify the directory structure of the bundle being created, streamlining development workflows. As of October 2024, Databricks offers four default project templates to help users get started quickly. However, these default templates may not always meet the diverse needs of different customer use cases.
You can create custom bundle templates tailored to your organization’s needs. These templates help establish organizational standards for new projects, including default permissions, service principals, and CI/CD configurations. Furthermore, you can integrate your bundle templates with CI/CD workflows, streamlining both development and deployment processes.
For instance, a customer in the pharmaceutical sector conducting clinical trials on Databricks needs the data science team to create a new project for each study. To facilitate their CI/CD workflow, they use GitHub Actions. By creating a bundle template specifically for studies – where the study name is a parameter – they can easily organize each project. Using this tutorial as a guide, their databricks_template_schema.json would include a parameter like study_name, as shown below:
{
"properties": {
"study_name": {
"type": "string",
"description": "The study name of the clinic trial",
"order": 1
}
}
}
Their template structure might include a bundle definition template file (databricks.yml.tmpl), resource definitions in the resources folder, source code in the src folder, and test scripts in the tests folder.
.
├── databricks_template_schema.json
└── template
├── databricks.yml.tmpl
├── resources
│ └── {{.study_name}}_job.yml.tmpl
└── src
│ └── {{.study_name}}_task.py
├── tests
│ └── integration
│ └── {{.study_name}}_integration_test.py
│ └── unit
│ └── {{.study_name}}_transform_logic_test.py
With custom templates, they can effortlessly start a new project by initializing a bundle from a shared template. Additionally, the team can leverage a parameterized GitHub Action workflow. For instance, the workflow can use the study_name as an input parameter to execute the appropriate notebooks for unit testing
name: databricks-cicd
inputs:
study_name:
description: 'The study name of the trial'
required: true
...
jobs:
...
unit-test:
runs-on: ubuntu-latest
...
- name: run unit test
uses: databricks/run-notebook@v0
with:
local-notebook-path: tests/unit/${{ inputs.study_name }}_transform_logic_test.py
git-commit: ${{ github.event.pull_request.head.sha || github.sha }}
...
In some cases, users need to build Python wheels to encapsulate specific business or functional logic for reuse across multiple stages or environments. A common question is where these wheels should be stored. Generally, there are three options, each with its own pros, cons, and authentication methods. Here we rank them according to ease of use.
1. Artifacts store
As the first and perhaps most straightforward option, you can push wheels to a public repository like PyPI or a private package repository. This approach ensures that wheels can be installed from anywhere with an internet connection, or via a private network if you host the repository yourself. This option maximizes reusability, allowing the wheels to be installed not only in Databricks but also in other environments. These repositories also natively support versioning, making it easy to upgrade or downgrade to different versions.
Credentials such as key pairs or tokens need to be securely managed for authentication. The best practice in Databricks is to store these credentials using the platform's native secrets management, which integrates seamlessly with the platform.
2. Unity Catalog Volumes
As the second option, Unity Catalog Volumes provide governance over non-tabular datasets and can also store Python wheels. Like Workspace files, volumes offer convenient paths (formatted as /Volumes/<catalog>/<schema>/<volume>/<folder>) and leverage the unified permission model to enable secure access.
The main advantage of using Unity Catalog Volumes over Workspace files is that wheels can be uploaded once and shared across multiple workspaces (under the same metastore) via catalog-workspace binding. However, Unity Catalog Volumes, being logical layers over cloud storage, do not support object versioning. As with Workspace files, you’ll need to rely on version control to manage wheel versions.
3. Workspace files
For project-specific implementations that are unlikely to be reused across teams, keeping Python wheels as Databricks Workspace files can be a practical solution. These wheels can either be stored within the project that builds them or in another Workspace location outside the project tree. Two key benefits of using Workspace files include:
However, these wheels will only be accessible within the workspace where they are uploaded. To make them available in other workspaces, you’ll need to use a CI/CD pipeline and Databricks API to publish them to additional workspaces. Additionally, since Workspace files don’t support versioning, you’ll need to use version control to manage updates or remove old versions.
Bundle variables are a powerful tool for configuring your bundle deployments, enabling you to customize your project deployments. However, careful consideration is needed when using them. Over time, adding too many variables to meet project requirements can complicate the deployment process. In short: keep it simple! Striking the right balance between flexibility and simplicity is key. Here are some important tips to keep in mind:
Avoid side effects: While variables add flexibility, be cautious about introducing side effects – especially through boolean variables that change control flows in your pipelines. Over time, these can make deployments unnecessarily complex. If you find yourself adding too many conditional variables, it might be time to consider splitting your project into multiple bundles and modularizing parts for reuse.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.