Databricks Community

achntrl · ‎08-01-2024

Hello everyone,

We're in the process of migrating to Databricks and are encountering challenges implementing CI/CD using Databricks Asset Bundles. Our monorepo houses multiple independent bundles within a "dabs" directory, with only one team member working on a specific bundle at a time.

We've adopted a single-branch strategy with an "environment-per-folder" approach. Each bundle has a "databricks.yml" at its root and one or more environment folders (e.g., "uat/", "prd/"). The targets in "databricks.yml" targets the appropriate environment folder. This setup enables easy feature testing in isolation and granular environment control.

Our goal is to efficiently determine which bundles require deployment/destruction after a Merge Request is merged into the "main" branch. We aim to trigger child pipelines only for bundles with actual changes in the Merge Request.

We've successfully identified changed bundles during Merge Request pipelines (i.e. when a Merge Request is opened) using git diff to compare the target and source branches:

- git fetch 
- changes=$(git diff --name-status origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME origin/$CI_COMMIT_REF_NAME -- dabs/

This first pipeline allows to ensure the changed bundles are valid using databricks bundle validate.

However, replicating this logic for the pipeline triggered after the Merge Request is merged (which is a branch pipeline) is proving difficult. We need to accurately identify the same changed bundles for deployment/destruction (i.e. databricks bundle deploy -t {env}/databricks bundle destroy -t {env}).

We've thought about two potential solutions:

Comparing commits: Determining relevant commits from the Merge Request and comparing them to the "main" branch. This seems complex due to potential squashing and concurrent Merge Requests.
Using artifacts: Storing changed bundles as an artifact during the Merge Request pipeline and retrieving/using this artifact in the subsequent pipeline. This approach might be complex due to potential naming conflicts.

Is there a more efficient way to identify changed bundles after a Merge Request is merged? We could simply re-deploy all the bundles but what about destruction of bundles? We have the need for bundle destruction in certain environments (e.g., uat) to manage costs and UI cleanliness as we'll have lots of jobs.

We'd appreciate any advice or alternative approaches to tackle this challenge.

Thanks in advance for your help!

mark_ott · 3 weeks ago

Your challenge—reliably determining the subset of changed Databricks Asset Bundles after a Merge Request (MR) is merged into main for focused deploy/destroy CI/CD actions—is common in complex monorepo, multi-environment setups. Let’s break down the problem and present practical solutions tailored to Databricks Bundle and monorepo-specific workflows.

Key Challenges

Branch pipeline context loss: After a MR merges, the direct comparison context (source↔target) is unavailable. Only the "main" branch is present, making it hard to know which bundles changed, especially with squashed or rebased merges.
Efficiency: Deploying/destroying all bundles is wasteful—targeted actions are needed.
Clean-up (destruction): To avoid resource waste in dynamic environments, you must reliably detect which bundles should be destroyed after a MR merge.

Practical Strategies

1. Artifact and Bundle Tracking (Recommended)

Store a manifest of changed bundles as a pipeline artifact during the MR pipeline, and then consume this manifest in the main branch (post-merge) pipeline.

How it works:
- During MR pipeline: Use git diff to determine changed bundle folders, save to a manifest file (e.g., changed_bundles.txt), and upload as an MR pipeline artifact.
- Main branch pipeline: Download the artifact from the most recent MR pipeline(s) associated with that commit (using your CI system’s API or environment variables). Use this manifest to restrict deployment/destruction actions.
Advantages:
- Precise, works through squash/rebase merges.
- Minimizes unnecessary deploy/destroy actions.
- Scales to multiple contributors and concurrent MRs.
Potential caveats:
- Artifact retrieval scripts must handle edge cases with multiple concurrent pipelines and merges.
- Some CI/CD systems (like GitLab and GitHub Actions) support "pass artifacts" or "pipeline attachments," but you must map pipelines to commits.

2. Commit Message and Metadata Parsing

Require the MR pipeline to attach metadata to the merge commit on main (e.g., via commit message, tag, or secret file update) listing the changed bundles. Your main branch pipeline can parse this metadata.

How it works: For example, append something like BUNDLES_CHANGED: dabs/foo, dabs/bar to the merge commit. Use CI scripting to parse the commit message for bundle names.
Caveats: Less robust if contributors bypass conventions or push directly.

3. Re-Query the Git History (with Limits)

In the main branch pipeline, use git log + git diff to compare the last merge commit(s) or the last N commits on main. This is imperfect with large/parallel changes, but works if merges are sequential.

Example:

text

git fetch origin git diff --name-status HEAD~1 HEAD -- dabs/

(Replace HEAD~1 with the correct ancestor commit based on your CI/CD provider’s merge mechanism.)
Limitations: Won’t account for overlapping changes from multiple merged MRs if using squash/rebase.

Handling Destruction of Bundles

Destruction is harder: it requires knowledge of both changed and deleted bundles (e.g., folders removed in the MR).

Store full bundle folder inventory as a file on the main branch (e.g., all_bundles.txt), regenerate it in each pipeline, and track deletions via diff.
Deploy logic: If a bundle was present in the previous main pipeline but no longer exists, trigger destruction for that environment.

Recommendations

Use artifact-based detection: This is standard in larger monorepo CI/CD and is robust to squash merges and rebases. Artifacts can be passed with careful management and CI API scripting.
Implement a “bundle index”: Keep a simple manifest (all_bundles.txt) and compare it between runs to detect removals.
Consider automating bundle meta-tracking: If your CI allows storing variables per pipeline or commit, you can automate bundle change tracking and clean-ups.

Example: Artifact-Based Approach (GitLab, Pseudocode)

MR Pipeline:

bash

git diff --name-status origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME origin/$CI_COMMIT_REF_NAME -- dabs/ | \
awk '{print $2}' | awk -F/ '{print $2}' | sort -u > changed_bundles.txt
# Save as artifact

Main Pipeline:

Use the CI system’s API (e.g., GitLab’s CI_PIPELINE_SOURCE and/or its REST API) to find the latest pipeline for each merged MR commit and retrieve changed_bundles.txt.
Use this list to trigger deployments/destructions.

Avoiding Deploy-All

Never deploy/destroy all unless necessary. This is inefficient and risks unintended side effects.
For deleted bundles: Compare the previous and current list of bundles and destroy only those removed from the repo.