cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Vladsiv
Databricks Employee
Databricks Employee

 

Introduction

Databricks Asset Bundles (DAB) is a structured way to define, deploy, and manage Databricks workflows, including jobs, clusters, dashboards, model serving endpoints, and other resources, using declarative YAML configurations. It allows us to implement software engineering best practices by enabling version control, CI/CD integration, and automation by treating infrastructure and data assets as code. DAB simplifies collaboration and deployment across environments, ensuring easier management and scalability of Databricks projects.

If you are completely new to DAB, please refer to What are Databricks Asset Bundles before continuing with this blog post.

By using DAB, you can define Databricks resources like jobs, pipelines, and notebooks as source files. These files fully describe a project and the code that governs it, providing project structure and automation for testing and deployment. Additionally, DAB allows us to define deployment targets that can be fully customized depending on the use case and the project’s needs, such as development, staging, and production.

In some cases, specific environments may require additional resources. For example, staging might include extra testing pipelines that are not needed in production once validation is complete.

In this blog post, we will explore different ways of customizing resource deployments per target.

 

Using include and resources mapping

As we know, Databricks resources are defined by specifying the type of resource and its configuration under the resources mapping in databricks.yml. Resources can include Databricks apps, clusters, dashboards, jobs, pipelines, model serving endpoints, and more. The following example illustrates how to define a simple job and a job cluster:

resources:
  jobs:
    hello-job:
      name: hello-job
      max_concurrent_runs: 1
      job_clusters:
        - job_cluster_key: small_cluster
          new_cluster:
            spark_version: 16.4.x-scala2.13
            node_type_id: m5d.large
            num_workers: 1
      tasks:
        - task_key: hello-task
          notebook_task:
            notebook_path: ./hello.py

By default, resources can be declared at the top level, making them available across all deployment targets:

# databricks.yml

bundle:
  name: test-bundle

resources:
  ...

targets:
  dev:
    default: true
    ...
  stg:
    ...
  prd:
    ...

With this setup, the same resources are deployed to dev, stg, and prd, ensuring consistency across all environments.

In many cases, different environments require unique configurations. For example, you might need additional testing pipelines in staging but not in production. To achieve this, you can define resources specific to each deployment target:

# databricks.yml

bundle:
  name: test-bundle

resources:
  ...

targets:
  dev:
    default: true
    resources:
      ...
    ...
  stg:
    resources:
      ...
    ...
  prd:
    resources:
      ...
    ...

By defining resources at the target level, additional resources are deployed only where needed, while still inheriting the global resources from the top level. This flexible approach ensures that each environment is optimized for its purpose without unnecessary configurations.

Be aware that each resource has an identifier. If you use the same identifier at both the top level and the target level, the target-level definition will take precedence and override the top-level definition. 

For detailed guidance on customizing configurations for specific targets, refer to the Override cluster settings section in the Databricks Asset Bundles documentation. Additionally, explore a sample code that illustrates how to configure integration tests by redefining resources for each target environment.

So far, everything has been defined in a single YAML file, which can make readability and management challenging as the project grows. Let’s explore some strategies to enhance flexibility, support customizations, and ensure seamless scalability as the project expands.

The include mapping allows us to add a list of path globs containing configuration files to include within the bundle. These path globs are relative to the location of the bundle configuration file in which the path globs are specified.

Therefore, we can structure the project in the following form:

project/
├── tests/
│   └── ...
├── resources/
│   ├── pipelines.yml
│   ├── jobs.yml
│   └── dashboards.yml
├── src/
│   ├── notebook_a.ipynb
│   ├── notebook_b.ipynb
│   └── ...
├── databricks.yml
└── ...

And instead of using resources in a single YAML, we just specify what we want to include:

# databricks.yml

bundle:
  name: test-bundle

include:
  - resources/*.yml

targets:
  dev:
    default: true
    ...
  stg:
    ...
  prd:
    ...

Each file in resources/*.yml contains its own resource definition, allowing for a structured and modular approach to managing YAML files. This separation keeps resource definitions organized and easy to manage as the project scales. However, the include mapping can only be used at the top level, meaning all included resources will be deployed to every target.

The important fact to note here is that the include mapping reads all the files, combining the resource definitions and merging them per target, allowing overrides as mentioned above.

If we need to add resources for a specific target, we must define resources at the target level, as shown in the previous example. While this approach works, it can become cumbersome when managing numerous customizations across multiple targets. To address this challenge, let’s explore a better way of structuring target-specific YAML files in the next section.

 

Separating Target YAMLs

A key advantage of DAB is that the include directive is not limited to resources - it can also be used for other top-level keys like targets mapping. This allows for a more modular and scalable project structure.

To improve organization and maintainability, we can structure our project as follows:

project/
├── tests/
│   └── ...
├── resources/
│   ├── pipelines.yml
│   ├── jobs.yml
│   └── dashboards.yml
├── src/
│   ├── notebook_a.ipynb
│   ├── notebook_b.ipynb
│   └── ...
├── targets/
│   ├── dev.yml
│   ├── stg.yml
│   └── prd.yml
├── databricks.yml
└── ...

In this setup:

  • The resources/ directory contains shared resources used by all environments.
  • The targets/ directory holds YAML files that define the specific resources for each deployment target.

Each target file, such as targets/dev.yml, includes only the resources specific to that environment.

# targets/dev.yml

targets:
  dev:
    default: true
    resources:
      ...
    ...

Similarly, targets/stg.yml and targets/prd.yml will include the appropriate resources for their respective environments.

In databricks.yml, we can now include the common resources while allowing each target to bring its own specific configurations:

# databricks.yml

bundle:
  name: test-bundle

include:
  - resources/*.yml
  - targets/*.yml

This setup provides a structured and scalable way to manage resources, ensuring that each environment gets precisely the resources it needs without unnecessary duplication. By splitting resource definitions across separate YAML files for each target, project teams gain better organization, control, and flexibility.

This modular approach simplifies configuration management, making it easier to track changes, customize deployments, and avoid bloated YAML files.

 

Splitting Resources

The previous examples grouped multiple resources into shared YAML files. Another approach is to define resources individually and associate them directly with their target environments. This creates a clean, intuitive structure where each resource lives next to the target it’s meant for.

Example project layout:

project/
├── tests/
│   └── ...
├── resources/
│   ├── pipeline_a.yml
│   ├── pipeline_b.yml
│   └── ...
├── src/
│   ├── notebook_a.ipynb
│   ├── notebook_b.ipynb
│   └── ...
├── databricks.yml
└── ...

In this setup, databricks.yml includes all resource files:

# databricks.yml

bundle:
  name: test-bundle

include:
  - resources/*.yml

targets:
  dev:
    ...
  stg:
    ...
  prd:
    ...

Each resource file defines a resources mapping alongside the target(s) it’s intended for. For example:

# pipeline_a.yml

anchor_name: &anchor_name
  resources:
    ...

targets:
  dev:
    <<: *anchor_name
# pipeline_b.yml

anchor_name: &anchor_name
  resources:
    ...

targets:
  dev:
    <<: *anchor_name
  stg:
    <<: *anchor_name

Here we are leveraging YAML anchors to keep the resource definition DRY and easy to read. YAML anchors are a feature that allows you to define reusable blocks of configuration, which can then be referenced elsewhere in your YAML file to avoid repetition.

The pipeline_a.yml will only deploy its defined resources to the dev environment, while the pipeline_b.yml will deploy to both the dev and stg environments. The main databricks.yml defines the targets, and the included YAML file extends the corresponding section.

You can freely mix this approach with the previously described structures. Functionally, they achieve the same result. The choice depends on your preferences and the level of modularity you need. This method is especially powerful when you have many environment-specific resources and want to avoid cluttering large files with multiple unrelated configurations.

Explore the following code example to see how this approach enables the implementation of a comprehensive MLOps project, from training and validation to deployment, using a multiphase workflow with customized resources tailored for each stage.

 

Runtime Editing

DAB supports substitutions and custom variables, enabling modular, reusable, and dynamic configuration files. These features allow values to be retrieved at runtime, ensuring that resource configurations can be adjusted dynamically when deploying and running a bundle.

Variables can be assigned different values for each target, and by leveraging default values, you can implement conditional overrides for specific resource settings as needed. Please refer to the following example code to see how this logic can be implemented.

Unfortunately, DAB does not currently support using variables to dynamically set include directives. As a workaround, we can modify the databricks.yml file within a CI/CD pipeline by substituting variables before executing the databricks bundle deploy command. This approach allows for greater flexibility in managing environment-specific configurations while maintaining automation in the deployment process.

We can enhance our project structure by leveraging environment variables to dynamically set configurations for each target, making deployments even more flexible:

project/
├── tests/
│   └── ...
├── resources/
|   ├── common/
│   |   ├── pipelines.yml
│   |   ├── jobs.yml
│   |   └── dashboards.yml
|   ├── dev/
│   |   └── ...
|   ├── stg/
│   |   └── ...
|   ├── prd/
│   |   └── ...
├── src/
│   ├── notebook_a.ipynb
│   ├── notebook_b.ipynb
│   └── ...
├── targets/
│   ├── dev.yml
│   ├── stg.yml
│   └── prd.yml
├── databricks.yml
└── ...

Where databricks.yml looks like:

# databricks.yml

bundle:
  name: test-bundle

include:
  - resources/common/*.yml
  - resources/${target}/*/yml
  - targets/*.yml

In a CI/CD pipeline, we typically pull the DAB code from a Git repository and set an environment variable that represents the target deployment environment. Before running the databricks bundle deploy command, we can dynamically replace ${target} with the appropriate environment variable using a simple command like sed:

sed -i -e 's/${target}/'"$TARGET"'/g' databricks.yml

When running this command, every instance of ${target} in the databricks.yml file is replaced with the value of the $TARGET environment variable. This ensures that the Databricks bundle only loads the configuration and resources specific to the intended deployment environment.

By leveraging this approach, we achieve greater flexibility, allowing each deployment to dynamically include the correct resources based on the target environment.

 

Conclusion

Scaling DAB projects and customizing deployments for multiple targets while maintaining a modular, manageable structure can be a challenge. However, with the right approach, you can streamline your workflow and reduce overhead as your projects grow.

In this article, we explored several practical techniques to help you manage increasing complexity and customization needs.

To recap:

  • Use include and resource mapping - Instead of putting everything into a single databricks.yml, spread out resources into separate YAML files and use include mapping to assemble them. This improves clarity and scalability.
  • Separating Target YAMLs - For environment-specific customization, maintain distinct YAML files for each target. This keeps configurations clean and easy to manage.
  • Splitting Resources - Define each resource individually and associate them directly to their target environments in the same YAML. This modular approach makes updates and troubleshooting much simpler.
  • Runtime Editing - Utilize CI/CD pipelines and environment variables to dynamically adjust resource targets at deployment time, enabling flexible and automated workflows.

I hope these approaches and examples have provided you with a clearer understanding of how to structure your Databricks project using DAB. By implementing these strategies, you can achieve greater target customization while ensuring your deployment remains scalable, modular, and easy to manage.

Happy coding!

1 Comment