02-26-2024 11:36 AM
Hello,
My team and I are experimenting with bundles, we follow the pattern of having one main file Databricks.yml and each job definition specified in a separate yaml for modularization.
We wonder if it is possible to select from the main Databricks.yml which jobs resources are deploy per target. In specific, we have a job called test_<name of the application>, which contain all the unit testing and integration testing. Ideally, this test job would only be deploy in development alongside the rest of the resources, while in production the test job would be excluded.
Below an example of the Databricks.yml.
# yaml-language-server: $schema=..\..\bundle_config_schema.json
bundle:
name: app_name
include:
- resources/*.yml
- tests/test_job.yml
targets:
dev:
# We know this is not possible but ideally something like this would be brilliant
# include:
# - resources/*.yml
# - tests/test_job.yml
default: true
variables:
slack_web_hoook: 111111111222222222222222
catalog: catalog_name
storage_account_name: storege_account_name
mode: development
workspace:
host: https://adb-******.azuredatabricks.net
prod:
# We know this is not possible but something like this would be brilliant
# exclude:
# - "tests/*"
variables:
slack_web_hoook: 1111112222333344444
catalog: _catalog_name
storage_account_name: storage_account_name
mode: production
workspace:
host: https://adb-***************.azuredatabricks.net
root_path: /Shared/.bundle/prod/${bundle.name}
run_as:
user_name: user.user@email
Is there any alternative better than defining the whole job resource within this file?
02-26-2024 10:19 PM
Hi @Adrianj
Have you checked the Target Override feature?
https://docs.databricks.com/en/dev-tools/bundles/job-task-override.html
https://docs.databricks.com/en/dev-tools/bundles/settings.html#examples
There doesn't seem to be a direct 'exclude' option in the bundle configuration. However, by carefully specifying what to 'include', you can effectively exclude unnecessary or unwanted files
02-26-2024 11:30 PM
It does not seem to be possible to exclude a resource, but what you can do is to import configurations inside the target definition instead of the top level bundle configuration. This allows you to deploy different resources accordingly but with the price of repeating include for each target config.
02-27-2024 01:30 AM
Hi Ariaa, thanks for answering. Do you have an example? When I add "include" below the target definition, it does give me an error when validating. Thanks in advance 🙂
02-27-2024 02:09 AM
take a look at the documentation for "sync".
02-27-2024 02:44 AM
Thanks Ariaa, I have previously tried with "Sync", unfortunately, as per my understanding. Sync works with files that you want to include or exclude as part of the bundle deployment, for example: notebooks, wheels or others. Unfortunately, it only treats the config yaml files as normal files, meaning it does not apply the configurations as part of the bundle.
03-01-2024 01:05 AM
Very important feature to add to Data Asset Bundles, we have the same issue right now, thanks for bringing it up
03-01-2024 03:54 AM
Don't know if this is what you are trying to achieve, but during my limited testing I managed to deploy an extra notebook to a DLT pipeline to **only** stg by referencing it as an additional library:
targets:
dev:
default: true
resources:
pipelines:
sales_pipeline:
development: true
stg:
workspace:
host: https://xxxx.x.azuredatabricks.net/ #Replace with the host address of your stg environment
resources:
pipelines:
sales_pipeline:
libraries:
#Adding a Notebook to the DLT pipeline that tests the data
- notebook:
path: "./50_tests/10_integration/DLT-Pipeline-Test.py"
development: true
prod:
workspace:
host: https://xxx.x.azuredatabricks.net/ #Replace with the host address of your prod environment
resources:
pipelines:
sales_pipeline:
development: false
#Update the cluster settings of the DLT pipeline
clusters:
- autoscale:
min_workers: 1
max_workers: 2
The actual asset is deployed to all targets, but the pipeline in where it is referenced and ran is target specific.
More info and source code here if you want to test: ossinova/databricks-asset-bundles-demo: A demo of using databricks asset bundles (github.com)
04-01-2024 04:50 AM - edited 04-01-2024 04:52 AM
Yup, totally agree with you it would be great if we have the ability to use include/exclude at that level. But anyway, as mentioned by Ossinova, adding your job 'test_job.yml' contents (resources mapping) into the target mapping with that job (or more than one) could solve your problem.
Check here about https://docs.databricks.com/en/dev-tools/bundles/settings.html#targets: "If a target mapping specifies a workspace, artifacts, or resources mapping, and a top-level workspace, artifacts, or resources mapping also exists, then any conflicting settings are overridden by the settings within the target.".
That means the new job (or whatever) resource in your case will be appended to the existing ones if you didn't introduce any conflict (make sure names are different).
Here is an example how I added a job to run only in test environment (and not in "dev", "staging" and "prod"):
# The name of the bundle. run `databricks bundle schema` to see the full bundle settings schema.
bundle:
name: mlops-stacks
variables:
experiment_name:
description: Experiment name for the model training.
default: /Users/${workspace.current_user.userName}/${bundle.target}-mlops-stacks-experiment
model_name:
description: Model name for the model training.
default: mlops-stacks-model
seperator:
description: useful seperator index by PR number for test workflows. Default is nothing for other envs
default: ""
include:
- ./assets/*.yml
# Deployment Target specific values for workspace
targets:
dev:
default: true
workspace:
host: https://********************.databricks.com
staging:
workspace:
host: https://********************.databricks.com
prod:
workspace:
host: https://********************.databricks.com
test:
workspace:
host: https://********************.databricks.com
# dedicated path to deploy files for test envs by PR
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}${var.seperator}
variables:
# overwrite default experiment_name to have experiment by PR in test env
# (avoids "cannot create mlflow experiment: Node named '...-experiment' already exists")
experiment_name: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}${var.seperator}/${bundle.target}-mlops-stacks-experiment
resources:
# additional job to be deployed in 'test' for cleaning up tests' resources
jobs:
resources_cleanup_job:
name: ${bundle.target}${var.seperator}-mlops-stacks-resources-cleanup-job
max_concurrent_runs: 1
permissions:
- level: CAN_VIEW
group_name: users
tasks:
- task_key: resources_cleanup_job
job_cluster_key: resources_cleanup_cluster
notebook_task:
notebook_path: utils/notebooks/TestResourcesCleanup.py # without ../
base_parameters:
schema_full_name: test.mlops_stacks_demo
seperator: ${var.seperator}
git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
job_clusters:
- job_cluster_key: resources_cleanup_cluster
new_cluster:
num_workers: 3
spark_version: 13.3.x-cpu-ml-scala2.12
node_type_id: i3.xlarge
custom_tags:
clusterSource: mlops-stack/0.2
04-09-2024 08:24 AM
Hi @Adrianj , Please refer this medium.com post. I have tried explaining how dynamically you can change the content of the databricks.yml for each of the environment by maintaining single databricks.yml file with adequate level of parameters.
In your example, you may create environment wise folders and you may write something like below and then value of $(var.DeployEnv) can be replaced by azure tokenization task as described in bullet number 4.V
include:
- resources/$(var.DeployEnv)/*.yml
- testjobs/$(var.DeployEnv)/*.yml
https://medium.com/@hrushi.medhe/databricks-asset-bundles-azure-devops-project-57453cf0e642
07-02-2024 11:33 PM
Hi @Adrianj , which solution did you go for? I have 4 deployment targets so I would like to avoid having to create 4 folders with many duplicates.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group