Databricks Asset Bundles (DAB) is a structured way to define, deploy, and manage Databricks workflows, including jobs, clusters, dashboards, model serving endpoints, and other resources, using declarative YAML configurations. It allows us to implement software engineering best practices by enabling version control, CI/CD integration, and automation by treating infrastructure and data assets as code. DAB simplifies collaboration and deployment across environments, ensuring easier management and scalability of Databricks projects.
If you are completely new to DAB, please refer to What are Databricks Asset Bundles before continuing with this blog post.
By using DAB, you can define Databricks resources like jobs, pipelines, and notebooks as source files. These files fully describe a project and the code that governs it, providing project structure and automation for testing and deployment. Additionally, DAB allows us to define deployment targets that can be fully customized depending on the use case and the project’s needs, such as development, staging, and production.
In some cases, specific environments may require additional resources. For example, staging might include extra testing pipelines that are not needed in production once validation is complete.
In this blog post, we will explore different ways of customizing resource deployments per target.
As we know, Databricks resources are defined by specifying the type of resource and its configuration under the resources mapping in databricks.yml. Resources can include Databricks apps, clusters, dashboards, jobs, pipelines, model serving endpoints, and more. The following example illustrates how to define a simple job and a job cluster:
resources:
jobs:
hello-job:
name: hello-job
max_concurrent_runs: 1
job_clusters:
- job_cluster_key: small_cluster
new_cluster:
spark_version: 16.4.x-scala2.13
node_type_id: m5d.large
num_workers: 1
tasks:
- task_key: hello-task
notebook_task:
notebook_path: ./hello.py
By default, resources can be declared at the top level, making them available across all deployment targets:
# databricks.yml
bundle:
name: test-bundle
resources:
...
targets:
dev:
default: true
...
stg:
...
prd:
...
With this setup, the same resources are deployed to dev, stg, and prd, ensuring consistency across all environments.
In many cases, different environments require unique configurations. For example, you might need additional testing pipelines in staging but not in production. To achieve this, you can define resources specific to each deployment target:
# databricks.yml
bundle:
name: test-bundle
resources:
...
targets:
dev:
default: true
resources:
...
...
stg:
resources:
...
...
prd:
resources:
...
...
By defining resources at the target level, additional resources are deployed only where needed, while still inheriting the global resources from the top level. This flexible approach ensures that each environment is optimized for its purpose without unnecessary configurations.
Be aware that each resource has an identifier. If you use the same identifier at both the top level and the target level, the target-level definition will take precedence and override the top-level definition.
For detailed guidance on customizing configurations for specific targets, refer to the Override cluster settings section in the Databricks Asset Bundles documentation. Additionally, explore a sample code that illustrates how to configure integration tests by redefining resources for each target environment.
So far, everything has been defined in a single YAML file, which can make readability and management challenging as the project grows. Let’s explore some strategies to enhance flexibility, support customizations, and ensure seamless scalability as the project expands.
The include mapping allows us to add a list of path globs containing configuration files to include within the bundle. These path globs are relative to the location of the bundle configuration file in which the path globs are specified.
Therefore, we can structure the project in the following form:
project/
├── tests/
│ └── ...
├── resources/
│ ├── pipelines.yml
│ ├── jobs.yml
│ └── dashboards.yml
├── src/
│ ├── notebook_a.ipynb
│ ├── notebook_b.ipynb
│ └── ...
├── databricks.yml
└── ...
And instead of using resources in a single YAML, we just specify what we want to include:
# databricks.yml
bundle:
name: test-bundle
include:
- resources/*.yml
targets:
dev:
default: true
...
stg:
...
prd:
...
Each file in resources/*.yml contains its own resource definition, allowing for a structured and modular approach to managing YAML files. This separation keeps resource definitions organized and easy to manage as the project scales. However, the include mapping can only be used at the top level, meaning all included resources will be deployed to every target.
The important fact to note here is that the include mapping reads all the files, combining the resource definitions and merging them per target, allowing overrides as mentioned above.
If we need to add resources for a specific target, we must define resources at the target level, as shown in the previous example. While this approach works, it can become cumbersome when managing numerous customizations across multiple targets. To address this challenge, let’s explore a better way of structuring target-specific YAML files in the next section.
A key advantage of DAB is that the include directive is not limited to resources - it can also be used for other top-level keys like targets mapping. This allows for a more modular and scalable project structure.
To improve organization and maintainability, we can structure our project as follows:
project/
├── tests/
│ └── ...
├── resources/
│ ├── pipelines.yml
│ ├── jobs.yml
│ └── dashboards.yml
├── src/
│ ├── notebook_a.ipynb
│ ├── notebook_b.ipynb
│ └── ...
├── targets/
│ ├── dev.yml
│ ├── stg.yml
│ └── prd.yml
├── databricks.yml
└── ...
In this setup:
Each target file, such as targets/dev.yml, includes only the resources specific to that environment.
# targets/dev.yml
targets:
dev:
default: true
resources:
...
...
Similarly, targets/stg.yml and targets/prd.yml will include the appropriate resources for their respective environments.
In databricks.yml, we can now include the common resources while allowing each target to bring its own specific configurations:
# databricks.yml
bundle:
name: test-bundle
include:
- resources/*.yml
- targets/*.yml
This setup provides a structured and scalable way to manage resources, ensuring that each environment gets precisely the resources it needs without unnecessary duplication. By splitting resource definitions across separate YAML files for each target, project teams gain better organization, control, and flexibility.
This modular approach simplifies configuration management, making it easier to track changes, customize deployments, and avoid bloated YAML files.
The previous examples grouped multiple resources into shared YAML files. Another approach is to define resources individually and associate them directly with their target environments. This creates a clean, intuitive structure where each resource lives next to the target it’s meant for.
Example project layout:
project/
├── tests/
│ └── ...
├── resources/
│ ├── pipeline_a.yml
│ ├── pipeline_b.yml
│ └── ...
├── src/
│ ├── notebook_a.ipynb
│ ├── notebook_b.ipynb
│ └── ...
├── databricks.yml
└── ...
In this setup, databricks.yml includes all resource files:
# databricks.yml
bundle:
name: test-bundle
include:
- resources/*.yml
targets:
dev:
...
stg:
...
prd:
...
Each resource file defines a resources mapping alongside the target(s) it’s intended for. For example:
# pipeline_a.yml
anchor_name: &anchor_name
resources:
...
targets:
dev:
<<: *anchor_name
# pipeline_b.yml
anchor_name: &anchor_name
resources:
...
targets:
dev:
<<: *anchor_name
stg:
<<: *anchor_name
Here we are leveraging YAML anchors to keep the resource definition DRY and easy to read. YAML anchors are a feature that allows you to define reusable blocks of configuration, which can then be referenced elsewhere in your YAML file to avoid repetition.
The pipeline_a.yml will only deploy its defined resources to the dev environment, while the pipeline_b.yml will deploy to both the dev and stg environments. The main databricks.yml defines the targets, and the included YAML file extends the corresponding section.
You can freely mix this approach with the previously described structures. Functionally, they achieve the same result. The choice depends on your preferences and the level of modularity you need. This method is especially powerful when you have many environment-specific resources and want to avoid cluttering large files with multiple unrelated configurations.
Explore the following code example to see how this approach enables the implementation of a comprehensive MLOps project, from training and validation to deployment, using a multiphase workflow with customized resources tailored for each stage.
DAB supports substitutions and custom variables, enabling modular, reusable, and dynamic configuration files. These features allow values to be retrieved at runtime, ensuring that resource configurations can be adjusted dynamically when deploying and running a bundle.
Variables can be assigned different values for each target, and by leveraging default values, you can implement conditional overrides for specific resource settings as needed. Please refer to the following example code to see how this logic can be implemented.
Unfortunately, DAB does not currently support using variables to dynamically set include directives. As a workaround, we can modify the databricks.yml file within a CI/CD pipeline by substituting variables before executing the databricks bundle deploy command. This approach allows for greater flexibility in managing environment-specific configurations while maintaining automation in the deployment process.
We can enhance our project structure by leveraging environment variables to dynamically set configurations for each target, making deployments even more flexible:
project/
├── tests/
│ └── ...
├── resources/
| ├── common/
│ | ├── pipelines.yml
│ | ├── jobs.yml
│ | └── dashboards.yml
| ├── dev/
│ | └── ...
| ├── stg/
│ | └── ...
| ├── prd/
│ | └── ...
├── src/
│ ├── notebook_a.ipynb
│ ├── notebook_b.ipynb
│ └── ...
├── targets/
│ ├── dev.yml
│ ├── stg.yml
│ └── prd.yml
├── databricks.yml
└── ...
Where databricks.yml looks like:
# databricks.yml
bundle:
name: test-bundle
include:
- resources/common/*.yml
- resources/${target}/*/yml
- targets/*.yml
In a CI/CD pipeline, we typically pull the DAB code from a Git repository and set an environment variable that represents the target deployment environment. Before running the databricks bundle deploy command, we can dynamically replace ${target} with the appropriate environment variable using a simple command like sed:
sed -i -e 's/${target}/'"$TARGET"'/g' databricks.yml
When running this command, every instance of ${target} in the databricks.yml file is replaced with the value of the $TARGET environment variable. This ensures that the Databricks bundle only loads the configuration and resources specific to the intended deployment environment.
By leveraging this approach, we achieve greater flexibility, allowing each deployment to dynamically include the correct resources based on the target environment.
Scaling DAB projects and customizing deployments for multiple targets while maintaining a modular, manageable structure can be a challenge. However, with the right approach, you can streamline your workflow and reduce overhead as your projects grow.
In this article, we explored several practical techniques to help you manage increasing complexity and customization needs.
To recap:
I hope these approaches and examples have provided you with a clearer understanding of how to structure your Databricks project using DAB. By implementing these strategies, you can achieve greater target customization while ensuring your deployment remains scalable, modular, and easy to manage.
Happy coding!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.