Databricks Community

tetan · ‎11-13-2024

In today's fast-paced development environments, DevOps or platform teams need to ensure consistency, scalability, and efficiency in managing data pipeline deployments. Infrastructure-as-Code (IaC) tools allow teams to automate the provisioning and management of resources, reducing manual errors and enhancing collaboration across teams.

In the realm of Databricks Data Intelligence Platform, while Terraform is a predominant tool to automate cloud infrastructure creation, maintenance, and evolution, its steep learning curve became an obstacle to data practitioners (data scientists, data engineers, ML engineers) who also want to automate and scale their workloads. That was the motivation for Databricks to introduce Databricks Asset Bundles, which are increasingly being adopted across Databricks customers to build CICD pipelines and deploy project-specific resources. Databricks Asset Bundles provide users with a standard holistic way to deploy projects on Databricks. They allow user isolation during development, environment overrides for different resources as well as taking care of the different API calls needed to publish objects like code, cluster configs, and jobs. This greatly reduces the effort to deploy and manage a Databricks project in production.

This blog is to introduce the following 5 practical tricks, for targeting data practitioners who are using Databricks Asset Bundles:

Verify parameters when deploying
Migrate existing jobs to Asset Bundles
Scale your projects with custom templates
Select the best method to put Python wheels
Smart use of variables

1. Verify Bundle configuration before deployment

To develop a bundle, we will go through a few steps, from preparing workload artifacts like notebooks to authoring a bundle configuration file. We may introduce misspellings, and syntax errors during these steps. To get quick feedback if the bundle is properly set up, we need to run the validate command, which plays a crucial role in ensuring that the asset bundles' configuration file is syntactically correct. We regularly run this command before deployment to catch any potential errors early. If the validation passes without issue, it gives us the confidence to proceed with the deploy command seamlessly.

But there's more to the validate command than just syntax checks. It also provides a comprehensive, parameterized template of the resources set to be deployed, offering a clear picture of what will be instantiated. Much like the terraform plan command, we can even customize the output by specifying various options, as shown below:

databricks bundle validate -t dev -o json

This command generates the resource definitions targeting the development environment. Optionally, you can export the output as a JSON file, which can then be shared with support teams for debugging purposes:

databricks bundle validate -t dev -o json > dab_dev.json

It’s important to note that, while Databricks Asset Bundles are built on Terraform, the output doesn't specifically indicate which changes will be applied compared to the previous version (unlike terraform plan). To track and review changes effectively - especially those related to artifacts like pipeline source code—we highly recommend using a version control system like GitHub. This practice enhances transparency and helps teams stay informed about updates.

2. Migrate existing assets to Asset Bundles

Databricks offers the flexibility to develop code and schedule jobs through its intuitive web portal or popular IDE plugins. However, for a more scalable and efficient workflow, leveraging Databricks Asset Bundles for managing job deployments is a more robust solution.

Migrating an existing job or pipeline to Databricks Asset Bundles can initially seem overwhelming, especially with the complexity of jobs created in the workspace – ranging from various task types to intricate code dependencies. Although you can easily copy and paste job definitions from the YAML view in the UI into a bundle, it's important to adjust the code paths to the relative paths used by the bundles.

Fortunately, the creators of Asset Bundles have streamlined this process. Instead of manually copying job definitions into your bundle folder, you can use a built-in command to automatically generate bundles from existing jobs in your workspace. By simply running the following command in a folder where you’ve already set up a Databricks Asset Bundle, the job definition and related code will be transferred directly into your bundle folder.

databricks bundle generate job --existing-job-id job_id

Locating the job ID is straightforward – you can either find it in the Workflows UI or retrieve it using the Databricks CLI. Once you have the job ID, running the command will generate a corresponding YAML file that contains the workflow definition, along with the job's associated code. The command also ensures that all code references are updated to use the correct paths. At this point, your workflow is ready to be seamlessly deployed using Databricks Asset Bundles.

3. Scale your projects with custom templates

Bundle templates provide a consistent and repeatable method for creating bundles by defining folder structures, build steps, tasks, tests, and so on. They specify the directory structure of the bundle being created, streamlining development workflows. As of October 2024, Databricks offers four default project templates to help users get started quickly. However, these default templates may not always meet the diverse needs of different customer use cases.

You can create custom bundle templates tailored to your organization’s needs. These templates help establish organizational standards for new projects, including default permissions, service principals, and CI/CD configurations. Furthermore, you can integrate your bundle templates with CI/CD workflows, streamlining both development and deployment processes.

For instance, a customer in the pharmaceutical sector conducting clinical trials on Databricks needs the data science team to create a new project for each study. To facilitate their CI/CD workflow, they use GitHub Actions. By creating a bundle template specifically for studies – where the study name is a parameter – they can easily organize each project. Using this tutorial as a guide, their databricks_template_schema.json would include a parameter like study_name, as shown below:

{

  "properties": {

    "study_name": {

      "type": "string",

      "description": "The study name of the clinic trial",

      "order": 1

    }

  }

}

Their template structure might include a bundle definition template file (databricks.yml.tmpl), resource definitions in the resources folder, source code in the src folder, and test scripts in the tests folder.

.

  ├── databricks_template_schema.json

  └── template

      ├── databricks.yml.tmpl

      ├── resources

      │   └── {{.study_name}}_job.yml.tmpl

      └── src

      │    └── {{.study_name}}_task.py

      ├── tests

      │    └── integration

      │            └── {{.study_name}}_integration_test.py

      │    └── unit

      │            └── {{.study_name}}_transform_logic_test.py

With custom templates, they can effortlessly start a new project by initializing a bundle from a shared template. Additionally, the team can leverage a parameterized GitHub Action workflow. For instance, the workflow can use the study_name as an input parameter to execute the appropriate notebooks for unit testing

name: databricks-cicd
inputs:
  study_name:
    description: 'The study name of the trial'
    required: true
...
jobs:
  ...
  unit-test:
    runs-on: ubuntu-latest
    ...
    - name: run unit test
        uses: databricks/run-notebook@v0
        with:
          local-notebook-path: tests/unit/${{ inputs.study_name }}_transform_logic_test.py
          git-commit: ${{ github.event.pull_request.head.sha || github.sha }}
          ...

4. Select Storage Options for Python Wheels

In some cases, users need to build Python wheels to encapsulate specific business or functional logic for reuse across multiple stages or environments. A common question is where these wheels should be stored. Generally, there are three options, each with its own pros, cons, and authentication methods. Here we rank them according to ease of use.

1. Artifacts store

As the first and perhaps most straightforward option, you can push wheels to a public repository like PyPI or a private package repository. This approach ensures that wheels can be installed from anywhere with an internet connection, or via a private network if you host the repository yourself. This option maximizes reusability, allowing the wheels to be installed not only in Databricks but also in other environments. These repositories also natively support versioning, making it easy to upgrade or downgrade to different versions.

Credentials such as key pairs or tokens need to be securely managed for authentication. The best practice in Databricks is to store these credentials using the platform's native secrets management, which integrates seamlessly with the platform.

2. Unity Catalog Volumes

As the second option, Unity Catalog Volumes provide governance over non-tabular datasets and can also store Python wheels. Like Workspace files, volumes offer convenient paths (formatted as /Volumes/<catalog>/<schema>/<volume>/<folder>) and leverage the unified permission model to enable secure access.

The main advantage of using Unity Catalog Volumes over Workspace files is that wheels can be uploaded once and shared across multiple workspaces (under the same metastore) via catalog-workspace binding. However, Unity Catalog Volumes, being logical layers over cloud storage, do not support object versioning. As with Workspace files, you’ll need to rely on version control to manage wheel versions.

3. Workspace files

For project-specific implementations that are unlikely to be reused across teams, keeping Python wheels as Databricks Workspace files can be a practical solution. These wheels can either be stored within the project that builds them or in another Workspace location outside the project tree. Two key benefits of using Workspace files include:

The ability to use relative paths for library imports, similar to local development.
Access control via Databricks’ unified permission model, allows you to grant appropriate permissions to Databricks users or Service Principals without needing separate credentials for authentication.

However, these wheels will only be accessible within the workspace where they are uploaded. To make them available in other workspaces, you’ll need to use a CI/CD pipeline and Databricks API to publish them to additional workspaces. Additionally, since Workspace files don’t support versioning, you’ll need to use version control to manage updates or remove old versions.

5. Smart use of variables

Bundle variables are a powerful tool for configuring your bundle deployments, enabling you to customize your project deployments. However, careful consideration is needed when using them. Over time, adding too many variables to meet project requirements can complicate the deployment process. In short: keep it simple! Striking the right balance between flexibility and simplicity is key. Here are some important tips to keep in mind:

Ensure environment defaults: Where possible, define default values for your different environments. This not only simplifies the deployment process but also makes your configurations more transparent. By including these defaults in your bundle configuration – rather than your CI/CD tool – you reduce the complexity of your deploy commands and keep everything neatly in one place.
Variable lookups: Bundle variables now support "lookups" for certain assets within your Databricks workspace. This feature enhances readability by allowing you to use meaningful names rather than obscure IDs. For example, instead of passing a Warehouse ID, you can reference the cluster by its name, making your configuration easier to understand.

This variable can then be used in the bundle like any other variable:

Avoid side effects: While variables add flexibility, be cautious about introducing side effects – especially through boolean variables that change control flows in your pipelines. Over time, these can make deployments unnecessarily complex. If you find yourself adding too many conditional variables, it might be time to consider splitting your project into multiple bundles and modularizing parts for reuse.

Databricks Community

5 tricks to get the most out of Databricks Asset Bundles

1. Verify Bundle configuration before deployment

2. Migrate existing assets to Asset Bundles

3. Scale your projects with custom templates

4. Select Storage Options for Python Wheels

5. Smart use of variables

Best practices for safe data experimentation with Databricks

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL