cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DAB bundle deploy --force-lock creates duplicate jobs after Azure DevOps pipeline failure

bhawuk_arora
New Contributor

Hi everyone,

We're experiencing an issue with Declarative Automation Bundles (DAB) while deploying to our Azure Databricks Production workspace through an Azure DevOps pipeline.

Environment

  • Azure Databricks

  • Declarative Automation Bundles (DAB)

  • Azure DevOps Pipeline

  • Authentication using a Service Principal

  • Deployment target: prod

  • Deployment lock enabled

Issue

During deployment, the Azure DevOps pipeline fails midway after successfully deploying some resources. As a result:

  • A subset of jobs gets created successfully in the Databricks workspace.

  • The deployment remains in a locked state.

  • To recover, we rerun the deployment using:

databricks bundle deploy -t prod --force-lock

However, instead of updating the existing jobs that were already deployed, Databricks creates new duplicate jobs with different Job IDs.

Expected Behavior

I would expect the subsequent deployment to:

  • Reuse the existing deployment state.

  • Update the already deployed jobs.

  • Continue deploying the remaining resources.

  • Avoid creating duplicate jobs.

Current Behavior

Each retry with --force-lock creates duplicate jobs rather than updating the existing ones.

Questions

  1. Is this expected behavior when using --force-lock after a partially successful deployment?

  2. Is there a recommended recovery process after a deployment fails midway?

  3. Is there a way to resume deployment without creating duplicate resources?

  4. Does this indicate that the deployment state (Terraform/DAB state) is being recreated or lost?

  5. Is there any recommended approach for Azure DevOps pipelines to prevent this scenario?

Any guidance or best practices would be greatly appreciated.

Thank you!

1 REPLY 1

balajij8
Contributor III

Hi,

  1. Is this expected behavior when using --force-lock after a partially successful deployment? - This is expected when the state becomes disconnected from actual workspace resources. DAB uses IDs generally to correlate bundle resources with workspace instances. If the state file does not contain the IDs of partially deployed resources, DAB treats it as new.

  2. Is there a recommended recovery process after a deployment fails midway? - You can use the bundle deployment bind command to manually link existing resources back to your bundle. It updates the state file to recognize the existing resource preventing duplicates in the next deployment

  3. Is there a way to resume deployment without creating duplicate resources? - You can try to manually reconnect the partially deployed resources to the bundle state. Identify which jobs were created, 

    for every job that exists - bind it to your bundle & deploy normally. You can also manually delete the partially deployed resources from the UI and deploy again.
databricks jobs list --output JSON | grep "jobname"

databricks bundle deployment bind resourcekey jobid -t prod

databricks bundle deploy -t prod

4. Does this indicate that the deployment state (Terraform/DAB state) is being recreated or lost? - The state is being lost or corrupted during the failed deployment. Next deployment runs with the old state that doesn't include the partially deployed resources. DAB sees it as missing from state and creates it again

5. Is there any recommended approach for Azure DevOps pipelines to prevent this scenario? - You can follow below.

  • Use explicit root_path in production
  • State file backup - You can consider backing up the state file from the workspace if possible before deploying.
  • You can use bundle validate before deployment to catch configuration issues
  • Consider fail-on-active-runs to prevent concurrent deployment issues