cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

CI/CD - Databricks Asset Bundles - Deploy/destroy only bundles with changes after Merge Request

achntrl
New Contributor

Hello everyone,

We're in the process of migrating to Databricks and are encountering challenges implementing CI/CD using Databricks Asset Bundles. Our monorepo houses multiple independent bundles within a "dabs" directory, with only one team member working on a specific bundle at a time.

We've adopted a single-branch strategy with an "environment-per-folder" approach. Each bundle has a "databricks.yml" at its root and one or more environment folders (e.g., "uat/", "prd/"). The targets in "databricks.yml" targets the appropriate environment folder. This setup enables easy feature testing in isolation and granular environment control.

Our goal is to efficiently determine which bundles require deployment/destruction after a Merge Request is merged into the "main" branch. We aim to trigger child pipelines only for bundles with actual changes in the Merge Request.

We've successfully identified changed bundles during Merge Request pipelines (i.e. when a Merge Request is opened) using git diff to compare the target and source branches:

 

- git fetch 
- changes=$(git diff --name-status origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME origin/$CI_COMMIT_REF_NAME -- dabs/

 

This first pipeline allows to ensure the changed bundles are valid using databricks bundle validate

However, replicating this logic for the pipeline triggered after the Merge Request is merged (which is a branch pipeline) is proving difficult. We need to accurately identify the same changed bundles for deployment/destruction (i.e. databricks bundle deploy -t {env}/databricks bundle destroy -t {env}).

We've thought about two potential solutions:

  • Comparing commits: Determining relevant commits from the Merge Request and comparing them to the "main" branch. This seems complex due to potential squashing and concurrent Merge Requests.
  • Using artifacts: Storing changed bundles as an artifact during the Merge Request pipeline and retrieving/using this artifact in the subsequent pipeline. This approach might be complex due to potential naming conflicts.

Is there a more efficient way to identify changed bundles after a Merge Request is merged? We could simply re-deploy all the bundles but what about destruction of bundles? We have the need for bundle destruction in certain environments (e.g., uat) to manage costs and UI cleanliness as we'll have lots of jobs.

We'd appreciate any advice or alternative approaches to tackle this challenge.

Thanks in advance for your help!

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @achntrl, using Git tags to track repository states, enhancing Git diff logic to identify changes precisely, employing a CI/CD tool with better artifact management like GitLab CI/CD or GitHub Actions, implementing automated cleanup scripts for environments like UAT, and leveraging the Databricks CLI and API for automating bundle management. An example GitHub Actions workflow involves steps for checking out code, setting up the Databricks CLI, identifying changed bundles, and deploying or destroying these bundles accordingly. For further guidance, resources such as Databricks CI/CD documentation can be helpful.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group