ā07-14-2025 01:48 AM
Hi all,
Could someone clarify the intended usage of the variable-overrides.json file in Databricks Asset Bundles?
Let me give some context. Let's say my repository layout looks like this:
databricks/
āāā notebooks/
ā āāā notebook.ipynb
āāā resources/
ā āāā job.yml
āāā databricks.yml
My job.yml looks somewhat like this:
jobs:
databricks_job:
name: databricks_job
max_concurrent_runs: 1
schedule:
quartz_cron_expression: "0 */5 * * * ?"
timezone_id: UTC
pause_status: ${var.pause_status}
tasks:
- task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../notebooks/notebook.ipynb
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: ${var.node_type_id}
data_security_mode: SINGLE_USER
autoscale:
min_workers: ${var.min_workers}
max_workers: ${var.max_workers}
parameters:
- name: parameter_key
default: ${var.parameter_value}
And my databricks.yml looks somewhat like this:
bundle:
name: databricks_jobs
include:
- resources/*.yml
- resources/*/*.yml
targets:
dev:
mode: development
default: true
workspace:
host: https://dev-workspace.azuredatabricks.net
prod:
mode: production
workspace:
host: https://prod-workspace.azuredatabricks.net
I'm deploying via an Azure DevOps pipeline using the Databricks CLI: databricks bundle deploy --target ${{ parameters.environment }}
In reality, my setup includes multiple environments, more jobs and more parametersāsuch as different Storage Account names, cluster configurations, etc. Iād prefer not to overload the databricks.yml with all of these environment-specific variables.
Instead, I came across the variable-overrides.json file, which seems like a promising alternative. However, the documentation simply states: "You can also define a complex variable in the .databricks/bundle/<target>/variable-overrides.json file [...]"
Hereās where Iām stuck:
Any insights, best practices, or examples would be much appreciated!
Thanks in advance!
ā07-14-2025 06:54 AM
It does. Thanks for the reponse. I also continued playing around with it and found a way using the variable-overrides.json file. I'll leave it here just in case anyone is interested:
Repository layout:
databricks/
āāā notebooks/
ā āāā notebook.ipynb
āāā resources/
ā āāā job.yml
āāā variables/
ā āāā dev/
ā ā āāā variable-overrides.json
ā āāā prod/
ā ā āāā variable-overrides.json
āāā databricks.yml
The job.yml looks like this:
jobs:
databricks_job:
name: databricks_job
max_concurrent_runs: 1
schedule:
quartz_cron_expression: "0 */5 * * * ?"
timezone_id: UTC
pause_status: ${var.pause_status}
tasks:
- task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../notebooks/notebook.ipynb
job_clusters:
- job_cluster_key: job_cluster
new_cluster: ${var.job_cluster}
parameters: ${var.parameters}
The databricks.yml looks like this:
bundle:
name: databricks_jobs
include:
- resources/*.yml
- resources/*/*.yml
variables:
pause_status:
description: Pause status of the job
type: string
default: "" // All declared variables are required to have a default value
job_cluster:
description: Configuration for the job cluster
type: complex
default: {} // of type map
parameters:
description: Parameters for the job
type: complex
default: [] // of type sequence
targets:
dev:
mode: development
default: true
workspace:
host: https://dev-workspace.azuredatabricks.net
prod:
mode: production
workspace:
host: https://prod-workspace.azuredatabricks.net
A variable-overrides.json would look like this:
{
"pause_status": "PAUSED",
"job_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"data_security_mode": "SINGLE_USER",
"autoscale": {
"min_workers": 1,
"max_workers": 4
}
},
"parameters": [
{
"name": "parameter_key",
"default": "parameter_value"
}
]
}
The Azure DevOps deployment pipeline has the following layout:
This approach works well for me and let's you maintain environment specific variables in dedicated json files. Common configurations can either be maintained directly in the job.yml or if they're shared across multiple jobs in the variable section of the databricks.yml.
ā07-14-2025 02:43 AM
What I did was creating an additional yaml file containing all this global config which is used in all bundles.
f.e. in a folder /common or /global you can define a 'globalconf.yml'.
In this file you define your global variables (in the variables section, like in databricks.yml), and you can even define your targets here (except for workspace URL, that is not permitted last time I checked).
Now, this file you include into your databricks.yml using the include section:
databricks.yml is now able to read the content of globalconf.yml.
We still have to pass this to the resources dir (job definition). This can be done by defining variables in the databricks yaml which are filled with values of globalconf yaml.
(the files that reside in /resources cannot access the globalconf yaml file, only databricks yaml).
Like this you can put a ton of config into a global file.
Not sure if it makes sense.
ā07-14-2025 06:47 AM
I believe this page is the most interesting to clarify your doubts: Substitutions and variables in Databricks Asset Bundles | Databricks Documentation.
I will try to adapt it to your use case, where I guess you are already adding your bundles variables in databricks.yml.
bundle:
name: databricks_jobs
include:
- resources/*.yml
- resources/*/*.yml
targets:
dev:
mode: development
default: true
workspace:
host: https://dev-workspace.azuredatabricks.net
variables:
your_variable: value-dev
prod:
mode: production
workspace:
host: https://prod-workspace.azuredatabricks.net
variables:
your_variable: value-prod
variables:
your_variable:
description: Description.
default: default-value
Here you can define all your variables and need to specify them for each target group. Then as you did already, you access them in your resources by `${var.your_variable}`
When using bundles in your CI/CD tool then you have a few options to overwrite those variables:
databricks bundle validate --var="your_variable=new-value"
export BUNDLE_VAR_your_variable=new-valueā
{"your_variable": "new-value"}
Then you can always create complex variables (a variable with subfields), and method 3 allows you to overwrite them.
Be aware that there is a priority!
If you need to use the 3rd approach in your DevOps pipeline, then make sure to create the file if it doesn't exist, for example with a bash script step:
mkdir -p .databricks/bundle/dev
echo '{ "your_variable": "new-value" }' > .databricks/bundle/dev/variable-overrides.json
To conclude here is my view on variables:
I personally create a bunch of custom variables in my bundle and usually there are some that I won't change at run time but they change based on the target environment. For example catalog, schema and others. These I'll keep in each target definition.
Then I have Azure DevOps libraries (one per target environment) in which I can save safely authentication things (host, client_id, client_secret). These then I manage with DevOps stages and get the one I need based on your deploy or release strategy.
Finally, I have very few variables I want to modify when I use "databricks deploy" so I use method 1) described above and for instance I pass a git_sha for traceability which could be used as a tag for a job or a parameter in your entrypoint.
ā07-14-2025 06:54 AM
It does. Thanks for the reponse. I also continued playing around with it and found a way using the variable-overrides.json file. I'll leave it here just in case anyone is interested:
Repository layout:
databricks/
āāā notebooks/
ā āāā notebook.ipynb
āāā resources/
ā āāā job.yml
āāā variables/
ā āāā dev/
ā ā āāā variable-overrides.json
ā āāā prod/
ā ā āāā variable-overrides.json
āāā databricks.yml
The job.yml looks like this:
jobs:
databricks_job:
name: databricks_job
max_concurrent_runs: 1
schedule:
quartz_cron_expression: "0 */5 * * * ?"
timezone_id: UTC
pause_status: ${var.pause_status}
tasks:
- task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../notebooks/notebook.ipynb
job_clusters:
- job_cluster_key: job_cluster
new_cluster: ${var.job_cluster}
parameters: ${var.parameters}
The databricks.yml looks like this:
bundle:
name: databricks_jobs
include:
- resources/*.yml
- resources/*/*.yml
variables:
pause_status:
description: Pause status of the job
type: string
default: "" // All declared variables are required to have a default value
job_cluster:
description: Configuration for the job cluster
type: complex
default: {} // of type map
parameters:
description: Parameters for the job
type: complex
default: [] // of type sequence
targets:
dev:
mode: development
default: true
workspace:
host: https://dev-workspace.azuredatabricks.net
prod:
mode: production
workspace:
host: https://prod-workspace.azuredatabricks.net
A variable-overrides.json would look like this:
{
"pause_status": "PAUSED",
"job_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"data_security_mode": "SINGLE_USER",
"autoscale": {
"min_workers": 1,
"max_workers": 4
}
},
"parameters": [
{
"name": "parameter_key",
"default": "parameter_value"
}
]
}
The Azure DevOps deployment pipeline has the following layout:
This approach works well for me and let's you maintain environment specific variables in dedicated json files. Common configurations can either be maintained directly in the job.yml or if they're shared across multiple jobs in the variable section of the databricks.yml.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityāsign up today to get started!
Sign Up Now