cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Multiples Instances of a Databricks Asset Bundle

Nmtc9to5
New Contributor II

Hi everyone.

I'm new to Databricks Asset Bundling.

I'm trying to generate a parameterized DAB template, like a class in OOP, to allow the instantiation of multiple independent Lakeflow pipelines. However, when deploying the resources, even after changing the parameters and root path, the previously deployed resources are modified instead of creating new ones.

Is there a way to create these independent instances pipelines directly using DABs, perhaps with some specific configuration? Or am I using the wrong tool?

Thanks in advance for your help.

1 ACCEPTED SOLUTION

Accepted Solutions

Kirankumarbs
Contributor

What you're running into is how DABs tracks deployments. A bundle's identity in the workspace is determined by three things: the bundle name, the target name, and the deploying user. When you redeploy with different parameters while keeping those three unchanged, DAB treats it as an update to the existing deployment, not a new one. It matches by resource keys in the state file, not by parameter values.

There are a few ways to get what you want, depending on how dynamic you need this to be.

If you know the instances in advance, the most straightforward approach is what  mentioned: define multiple pipeline resources in your YAML, each with a unique resource key. You can keep them DRY by using custom variables and YAML anchors, or by splitting into separate resource files with include. Something like:

 
yaml
variables:
  schema_a:
    default: schema_alpha  schema_b:
    default: schema_beta
resources:
  pipelines:
    pipeline_a:
      name: my_pipeline_alpha      configuration:
        my_schema: ${var.schema_a}
    pipeline_b:
      name: my_pipeline_beta      configuration:
        my_schema: ${var.schema_b}

Not as elegant as a for-loop, but it works and is fully declarative.

If you want true dynamic instantiation (don't know ahead of time how many pipelines you need), DABs in YAML aren't really built for that. But since the recent Python support for DAB configuration, you can define resources programmatically. You write a Python file that generates resource definitions, and DABs pick them up. That gets you closer to the "class instantiation" pattern you're thinking of loop over a list of configs and emit a pipeline resource for each one.

If you want completely independent deployments from the same template (like, different teams deploying their own version), change the bundle.name per instance. That's what gives each deployment its own state. The uuid approach rvm1975 mentioned works too, but changing the bundle name is more readable and gives you a cleaner workspace layout. You can parameterize it:

 
yaml
bundle:
  name: my_pipeline_${var.instance_name}

variables:
  instance_name:
    description: "Unique name for this pipeline instance"

Then deploy with databricks bundle deploy -t dev --var instance_name=team_a. Each unique bundle name gets its own state file, its own workspace folder, and its own set of resources. No collisions.

One thing to watch out for: allow_duplicate_names: true lets you have multiple pipelines with the same display name, but it doesn't help with DAB state tracking. Two resources with the same key in the same bundle still overwrite each other regardless of that flag.

Hope this helps! If it does, could you please mark it as โ€œAccept as Solutionโ€? That will help other users quickly find the correct fix.

View solution in original post

4 REPLIES 4

Ale_Armillotta
Valued Contributor II

Hi.

The asset bundle works that every resource in the file will be deployed. If you change a parameter, the resource will contain only the new parameter and if you change the name the old pipeline will be dropped and a new one with a new name will be created. If you want 2 pipelines you have to create 2 pipelines resources.

By my experience thereโ€™s no way to create more pipelines for the same DAB that arenโ€™t in the asset bundle. 

what I think you can do is to create as many pipelines in yaml as you need with different parameters, also using the same files as tasks

rvm1975
New Contributor II
bundle:
  uuid: dd05ecbc-e823-4b33-821b-335be10bae5a

To make unique bundle use different uuid. 
Same with resources, table names etc. You can easily parametrize table names but for "pipeline_id" you may use template generator or jinja ...

resources:
  pipelines:
    pipeline_id:
      name: non_uniq_name
      configuration:
        my_schema: ${var.my_schema}
        my_table_prefix: ${var.my_table_prefix}
      allow_duplicate_names: true

 

Kirankumarbs
Contributor

What you're running into is how DABs tracks deployments. A bundle's identity in the workspace is determined by three things: the bundle name, the target name, and the deploying user. When you redeploy with different parameters while keeping those three unchanged, DAB treats it as an update to the existing deployment, not a new one. It matches by resource keys in the state file, not by parameter values.

There are a few ways to get what you want, depending on how dynamic you need this to be.

If you know the instances in advance, the most straightforward approach is what  mentioned: define multiple pipeline resources in your YAML, each with a unique resource key. You can keep them DRY by using custom variables and YAML anchors, or by splitting into separate resource files with include. Something like:

 
yaml
variables:
  schema_a:
    default: schema_alpha  schema_b:
    default: schema_beta
resources:
  pipelines:
    pipeline_a:
      name: my_pipeline_alpha      configuration:
        my_schema: ${var.schema_a}
    pipeline_b:
      name: my_pipeline_beta      configuration:
        my_schema: ${var.schema_b}

Not as elegant as a for-loop, but it works and is fully declarative.

If you want true dynamic instantiation (don't know ahead of time how many pipelines you need), DABs in YAML aren't really built for that. But since the recent Python support for DAB configuration, you can define resources programmatically. You write a Python file that generates resource definitions, and DABs pick them up. That gets you closer to the "class instantiation" pattern you're thinking of loop over a list of configs and emit a pipeline resource for each one.

If you want completely independent deployments from the same template (like, different teams deploying their own version), change the bundle.name per instance. That's what gives each deployment its own state. The uuid approach rvm1975 mentioned works too, but changing the bundle name is more readable and gives you a cleaner workspace layout. You can parameterize it:

 
yaml
bundle:
  name: my_pipeline_${var.instance_name}

variables:
  instance_name:
    description: "Unique name for this pipeline instance"

Then deploy with databricks bundle deploy -t dev --var instance_name=team_a. Each unique bundle name gets its own state file, its own workspace folder, and its own set of resources. No collisions.

One thing to watch out for: allow_duplicate_names: true lets you have multiple pipelines with the same display name, but it doesn't help with DAB state tracking. Two resources with the same key in the same bundle still overwrite each other regardless of that flag.

Hope this helps! If it does, could you please mark it as โ€œAccept as Solutionโ€? That will help other users quickly find the correct fix.

emma_s
Databricks Employee
Databricks Employee

Hey, 

As others have said you can't really do what you're trying to do via DABs. You have to specify each object for deployment and if you redeploy it will overwrite the old objects. There are two potential ways you would deploy the pipelines via DABs.

1. Manually specify each pipeline individually in your DAB but use yaml anchors to avoid repetition
2. Use a script potentially in Python to dynamically create the YAML for the all the combos you need.

At some point in the future Databricks will support Jobs passing parameters to pipelines which could be a better soltution to your problem. You could have all the parameters to pass to your pipeline in a table which the job loops through and passes to the pipeline.


I hope this helps.


Thanks,

Emma