This post is written by Daniel Taylor, Senior Solutions Engineer and Axel Richier, Solutions Architect.
Data engineering has historically suffered from the inheritance of software engineering best practices. As it emerged in the early 2010’s out of the shadow of new data intensive systems, the industry has always sought to apply the same practices uniformly, to a vastly different type of engineering.
Whether that be unit & integration testing, source control, build & release processes and orchestrating through CI/CD. In the data world, building a lot of these frameworks from scratch can be cumbersome, time consuming & fragile. Regardless, those best practices exist for a reason and should not be ignored, but instead improved upon & adapted to suit the type of workload at hand, empowering engineers to view these frameworks as necessities when going into production rather than a blockers or nice-to-haves.
One of the key strengths of the Databricks Intelligence Platform is the vast amount of DevX tooling at your disposal, helping engineers to get into production & to adhere to these industry standard best practices. Today, you can use a combination of Terraform for static, core infrastructure assets (more on this later) & Databricks Asset Bundles for application code, data projects and ephemeral resources such as job clusters for your application code.
According to the docs : “Databricks Asset Bundles are a tool to facilitate the adoption of software engineering best practices, including source control, code review, testing, and continuous integration and delivery (CI/CD), for your data and AI projects.”
But what does that actually mean in practice? And what tangible, real-world examples can we demonstrate so that we can realise these benefits?
Asset Bundles is a tool that comes pre-packaged as part of the Databricks CLI, providing a framework where engineering teams can structure source control repositories so that their application code lives alongside orchestration pipelines (whether that be Databricks jobs, Delta Live Tables, Databricks apps, training & inference of ML models, etc.). These pipelines are defined declaratively using YAML (or JSON, & more recently, Python) syntax & can adhere to any type of configuration your data project requires.
Behind the scenes, Asset Bundles act as a Databricks-supported wrapper around the Databricks Terraform provider for all resources relating to application code. This lowers the barrier to entry for engineers unfamiliar with Terraform’s concepts & syntax, whilst offering additional flexibility & tooling that caters specifically to that data development lifecycle we’ve mentioned above. Engineers who are familiar with Terraform will acknowledge its lack of fluidity & ability to quickly iterate whenit comes to things like unit & integration testing as well as state management, especially across different, simultaneous development tasks on the same subset of resources.
If you want to understand the core concepts behind Databricks Asset Bundles in much more detail, including the root mapping configuration, resource-specific mappings and CLI commands, we recommend starting with the documentation. We won’t be exploring these concepts too deeply in this blog.
The true power of Asset Bundles becomes clear when we examine how it enables multiple developers working on the same (or different, but within the same repository) data projects simultaneously.
In lower, non-controlled environments & if we were to use Terraform as DevX tool of choice, each individual’s deployment would overwrite existing configurations (assuming a single, remote state file was used), this is extremely problematic not only if individuals are working on the same resource, but different resources as well.
You can start to see how collaboration is bottlenecked when using Terraform as a DevX tool. As great of a tool as it is, it wasn’t really designed for this!
Asset Bundles solves this issue by applying isolation to an individuals’ state file in lower environments, all whilst abstracting state file management & maintenance away from the user. What does this actually mean in practice?
If I, engineer Alice, deployed my modified code repository on my own version controlled feature branch using asset bundles to a development environment, Asset Bundles would isolate all of my data projects to my own state file stored in the workspace file system, under my user. Thus, if engineer Bob also happened to simultaneously deploy his own modified code repository from his own feature branch using asset bundles, and to the same development environment, my changes & the data projects that exist in my branch would be completely unaffected in the workspace due to the same isolation mechanism.
Asset bundles has a concept of a top-level configuration, in essence, this is just a YAML file in a projects’ root (named `databricks.yaml`) that is used to define all of your deployment targets as well as reusable complex & static variables.
The key mapping that we need to consider in our bundles’ root configuration in this scenario is `mode`. `mode` can take one of two possible default values, `development` or `production`. For lower environments, we want to specify our targets’ deployment mode to `development`, in turn, this will deploy all resources under our bundle & isolate all those resources to the user issuing the deployment from their local machine. Asset bundles will also prefix each resource that has been deployed with a unique prefix so each job, delta live table, ML model, etc. will not be overwritten by simultaneous deployments.
Note - the `production` deployment mode should be used in conjunction with a service principal for all other environments such as UAT & production. If `development` or `production` don’t fit your needs, it is now possible to setup custom modes.
Let’s start to explore how seamless collaboration & independent development work is with Asset Bundles. As a prerequisite, ensure that you have the Databricks CLI installed & configured for your development environment, for this walkthrough we’re using CLI version 0.277.0.
If you want to follow along, the source code that we’ll be utilising below can be found here.
We’re going to imagine the scenario described above, engineer Alice has been assigned a task that involves a change to a specific DLT pipeline based on some upstream changes. In the same stand-up from earlier in the week, engineer Bob has been assigned a similar but distinct task that involves changing a different DLT pipeline based on some downstream reporting changes. Both DLT pipelines are defined in a version controlled mono-repository.
Both engineers pull the remote changes from the remote repository into their local version & branch off onto their unique feature branches.
git pull origin staging
git checkout -b feature/alice # or feature/bob
Each engineer then makes local changes to their respective DLT pipelines & are now ready to deploy to a lower environment so that they can perform integration tests.
Engineer A & engineer B both deploy all defined jobs from their respective feature branches, to the same lower environment & at the same time.
databricks bundle validate -t dev --profile <your_profile>
databricks bundle deploy -t dev --profile <your_profile>
We can now see in our development workspace, distinct versions of each DLT pipeline that is specific to the copy in each of our engineers’ feature branches. Without Asset Bundles, working on the same assets would lead to risks of overwrite and the impossibility of running developments separately.
Once each developer has tested their isolated changes in their respective branches (e.g., `dev/alice` and `dev/bob`), they can prepare to merge their updates into the main branch. Each developer would create a pull request from their feature branch to the main branch. The PR serves as a proposal to merge their changes, allowing for code reviews and discussions.
Once the PR is approved, the changes from each developer's branch are merged into the main branch, integrating their distinct updates into the shared workflow.
databricks bundle deploy -t prod --profile <your_profile>
This is how Databricks Asset Bundles removes development friction for data engineering teams, enabling iterative deployment while maintaining software engineering best practices. The result is a dramatically faster time-to-production cycle for data products!
By solving the collaboration challenges that typically plague data engineering teams, Asset Bundles allows you to focus on what matters—delivering valuable data pipelines and insights—rather than fighting with infrastructure and deployment conflicts.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.