cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Bundles - How to select which jobs resources to deploy per target?

Adrianj
New Contributor III

Hello, 

My team and I are experimenting with bundles, we follow the pattern of having one main file Databricks.yml and each job definition specified in a separate yaml for modularization. 

We wonder if it is possible to select from the main Databricks.yml which jobs resources are deploy per target. In specific, we have a job called test_<name of the application>, which contain all the unit testing and integration testing. Ideally, this test job would only be deploy in development alongside the rest of the resources, while in production the test job would be excluded. 

Below an example of the Databricks.yml. 

# yaml-language-server: $schema=..\..\bundle_config_schema.json
bundle:
  name: app_name

include:
  - resources/*.yml
  - tests/test_job.yml

targets:
  dev:
#  We know this is not possible but ideally something like this would be brilliant
#      include:
#        - resources/*.yml
#        - tests/test_job.yml
    default: true
    variables:
      slack_web_hoook: 111111111222222222222222
      catalog: catalog_name
      storage_account_name: storege_account_name
    mode: development
    workspace:
      host: https://adb-******.azuredatabricks.net

  prod:
# We know this is not possible but something like this would be brilliant
#     exclude:
#       - "tests/*"
    variables: 
      slack_web_hoook: 1111112222333344444
      catalog: _catalog_name
      storage_account_name: storage_account_name
    mode: production
    workspace:
      host: https://adb-***************.azuredatabricks.net
      root_path: /Shared/.bundle/prod/${bundle.name}
    run_as:
     user_name: user.user@email

Is there any alternative better than defining the whole job resource within this file?

10 REPLIES 10

AlliaKhosla
New Contributor III
New Contributor III

Hi @Adrianj 

Have you checked the Target Override feature?

https://docs.databricks.com/en/dev-tools/bundles/job-task-override.html

https://docs.databricks.com/en/dev-tools/bundles/settings.html#examples

There doesn't seem to be a direct 'exclude' option in the bundle configuration. However, by carefully specifying what to 'include', you can effectively exclude unnecessary or unwanted files

 

Ariaa
New Contributor II

It does not seem to be possible to exclude a resource, but what you can do is to import configurations inside the target definition instead of the top level bundle configuration. This allows you to deploy different resources accordingly but with the price of repeating include for each target config.

Adrianj
New Contributor III

Hi Ariaa, thanks for answering. Do you have an example? When I add "include" below the target definition, it does give me an error when validating. Thanks in advance 🙂 

Ariaa
New Contributor II

take a look at the documentation for "sync".

Adrianj
New Contributor III

Thanks Ariaa, I have previously tried with "Sync", unfortunately, as per my understanding. Sync works with files that you want to include or exclude as part of the bundle deployment, for example: notebooks, wheels or others. Unfortunately, it only treats the config yaml files as normal files, meaning it does not apply the configurations as part of the bundle.

BerkerKozan
New Contributor III

Very important feature to add to Data Asset Bundles, we have the same issue right now, thanks for bringing it up

ossinova
Contributor II

Don't know if this is what you are trying to achieve, but during my limited testing I managed to deploy an extra notebook to a DLT pipeline to **only** stg by referencing it as an additional library:

targets:
  dev:
    default: true
    resources:
      pipelines:
        sales_pipeline:
          development: true

  stg:
    workspace:
      host: https://xxxx.x.azuredatabricks.net/ #Replace with the host address of your stg environment
    resources:
      pipelines:
        sales_pipeline:
          libraries:
          #Adding a Notebook to the DLT pipeline that tests the data
          - notebook:
              path: "./50_tests/10_integration/DLT-Pipeline-Test.py"
          development: true

  prod:
    workspace:
      host: https://xxx.x.azuredatabricks.net/ #Replace with the host address of your prod environment
    resources:
      pipelines:
        sales_pipeline:
          development: false
          #Update the cluster settings of the DLT pipeline
          clusters:
            - autoscale:
                min_workers: 1
                max_workers: 2

The actual asset is deployed to all targets, but the pipeline in where it is referenced and ran is target specific. 

More info and source code here if you want to test: ossinova/databricks-asset-bundles-demo: A demo of using databricks asset bundles (github.com)

M_smile
New Contributor II

Yup, totally agree with you it would be great if we have the ability to use include/exclude at that level. But anyway, as mentioned by Ossinova, adding your job 'test_job.yml' contents (resources mapping) into the target mapping with that job (or more than one) could solve your problem. 
Check here about https://docs.databricks.com/en/dev-tools/bundles/settings.html#targets: "If a target mapping specifies a workspaceartifacts, or resources mapping, and a top-level workspaceartifacts, or resources mapping also exists, then any conflicting settings are overridden by the settings within the target.".
That means the new job (or whatever) resource in your case will be appended to the existing ones if you didn't introduce any conflict (make sure names are different).

Here is an example how I added a job to run only in test environment (and not in "dev", "staging" and "prod"):

 

 

# The name of the bundle. run `databricks bundle schema` to see the full bundle settings schema.
bundle:
  name: mlops-stacks

variables:
  experiment_name:
    description: Experiment name for the model training.
    default: /Users/${workspace.current_user.userName}/${bundle.target}-mlops-stacks-experiment
  model_name:
    description: Model name for the model training.
    default: mlops-stacks-model
  seperator:
    description: useful seperator index by PR number for test workflows. Default is nothing for other envs
    default: ""

include:
  - ./assets/*.yml

# Deployment Target specific values for workspace
targets:
  dev:
    default: true
    workspace:
      host: https://********************.databricks.com

  staging:
    workspace:
      host: https://********************.databricks.com

  prod:
    workspace:
      host: https://********************.databricks.com

  test:
    workspace:
      host: https://********************.databricks.com
      # dedicated path to deploy files for test envs by PR
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}${var.seperator}
    variables:
      # overwrite default experiment_name to have experiment by PR in test env 
      # (avoids "cannot create mlflow experiment: Node named '...-experiment' already exists")
      experiment_name: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}${var.seperator}/${bundle.target}-mlops-stacks-experiment
    resources:
      # additional job to be deployed in 'test' for cleaning up tests' resources 
      jobs:
        resources_cleanup_job:
          name: ${bundle.target}${var.seperator}-mlops-stacks-resources-cleanup-job
          
          max_concurrent_runs: 1

          permissions:
            - level: CAN_VIEW
              group_name: users

          tasks:
            - task_key: resources_cleanup_job
              job_cluster_key: resources_cleanup_cluster
              notebook_task:
                notebook_path: utils/notebooks/TestResourcesCleanup.py  # without ../
                base_parameters:
                  schema_full_name: test.mlops_stacks_demo
                  seperator: ${var.seperator}
                  git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}

          job_clusters:
            - job_cluster_key: resources_cleanup_cluster
              new_cluster:
                num_workers: 3
                spark_version: 13.3.x-cpu-ml-scala2.12
                node_type_id: i3.xlarge
                custom_tags:
                  clusterSource: mlops-stack/0.2

 

 


 

HrushiM
New Contributor II

Hi @Adrianj , Please refer this medium.com post. I have tried explaining how dynamically you can change the content of the databricks.yml for each of the environment by maintaining single databricks.yml file with adequate level of parameters. 

In your example, you may create environment wise folders and you may write something like below and then value of $(var.DeployEnv) can be replaced by azure tokenization task as described in bullet number 4.V

include:
  - resources/$(var.DeployEnv)/*.yml
  - testjobs/$(var.DeployEnv)/*.yml

https://medium.com/@hrushi.medhe/databricks-asset-bundles-azure-devops-project-57453cf0e642

thibault
Contributor II

Hi @Adrianj , which solution did you go for? I have 4 deployment targets so I would like to avoid having to create 4 folders with many duplicates.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group