cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

How to use variable-overrides.json for environment-specific configuration in Asset Bundles?

esistfred
New Contributor III

Hi all,

Could someone clarify the intended usage of the variable-overrides.json file in Databricks Asset Bundles?

Let me give some context. Let's say my repository layout looks like this:

databricks/
ā”œā”€ā”€ notebooks/
│   └── notebook.ipynb
ā”œā”€ā”€ resources/
│   └── job.yml
ā”œā”€ā”€ databricks.yml

My job.yml looks somewhat like this:

  jobs:
    databricks_job:
      name: databricks_job
      max_concurrent_runs: 1

      schedule:
        quartz_cron_expression: "0 */5 * * * ?"
        timezone_id: UTC
        pause_status: ${var.pause_status}

      tasks:
        - task_key: notebook_task
          job_cluster_key: job_cluster
          notebook_task:
            notebook_path: ../notebooks/notebook.ipynb

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster:
            spark_version: 15.4.x-scala2.12
            node_type_id: ${var.node_type_id}
            data_security_mode: SINGLE_USER
            autoscale:
              min_workers: ${var.min_workers}
              max_workers: ${var.max_workers}

      parameters:
        - name: parameter_key
          default: ${var.parameter_value}

And my databricks.yml looks somewhat like this:

bundle:
  name: databricks_jobs

include:
  - resources/*.yml
  - resources/*/*.yml

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://dev-workspace.azuredatabricks.net

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.azuredatabricks.net

I'm deploying via an Azure DevOps pipeline using the Databricks CLI: databricks bundle deploy --target ${{ parameters.environment }}

In reality, my setup includes multiple environments, more jobs and more parameters—such as different Storage Account names, cluster configurations, etc. I’d prefer not to overload the databricks.yml with all of these environment-specific variables.

Instead, I came across the variable-overrides.json file, which seems like a promising alternative. However, the documentation simply states: "You can also define a complex variable in the .databricks/bundle/<target>/variable-overrides.json file [...]"

Here’s where I’m stuck:

  • The .databricks/ directory is excluded by .gitignore and seems to be generated only at runtime.
  • Since I’m not deploying locally but via a DevOps pipeline, I’m unsure how to provide or inject these variable-overrides.json files into the .databricks/bundle/<target>/ directory during deployment.
  • What’s the recommended workflow for using variable-overrides.json in a CI/CD setup like Azure DevOps?

Any insights, best practices, or examples would be much appreciated!

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

esistfred
New Contributor III

It does. Thanks for the reponse. I also continued playing around with it and found a way using the variable-overrides.json file. I'll leave it here just in case anyone is interested:

Repository layout:

databricks/
ā”œā”€ā”€ notebooks/
│   └── notebook.ipynb
ā”œā”€ā”€ resources/
│   └── job.yml
ā”œā”€ā”€ variables/
│   └── dev/
│   │   └── variable-overrides.json
│   └── prod/
│   │   └── variable-overrides.json
ā”œā”€ā”€ databricks.yml

 The job.yml looks like this:

  jobs:
    databricks_job:
      name: databricks_job
      max_concurrent_runs: 1
      
      schedule:
        quartz_cron_expression: "0 */5 * * * ?"
        timezone_id: UTC
        pause_status: ${var.pause_status}

      tasks:
        - task_key: notebook_task
          job_cluster_key: job_cluster
          notebook_task:
            notebook_path: ../notebooks/notebook.ipynb

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster: ${var.job_cluster}

      parameters: ${var.parameters}

The databricks.yml looks like this:

bundle:
  name: databricks_jobs

include:
  - resources/*.yml
  - resources/*/*.yml

variables:
  pause_status:
    description: Pause status of the job
    type: string
    default: "" // All declared variables are required to have a default value
  job_cluster:
    description: Configuration for the job cluster
    type: complex
    default: {} // of type map
  parameters:
    description: Parameters for the job
    type: complex
    default: [] // of type sequence

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://dev-workspace.azuredatabricks.net

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.azuredatabricks.net

A variable-overrides.json would look like this:

{
    "pause_status": "PAUSED",
    "job_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "Standard_D3_v2",
        "data_security_mode": "SINGLE_USER",
        "autoscale": {
            "min_workers": 1,
            "max_workers": 4
        }
    },
    "parameters": [
        {
            "name": "parameter_key",
            "default": "parameter_value"
        }
    ]
}

The Azure DevOps deployment pipeline has the following layout:

  • Bash@3: Install Databricks CLI (if not installed on the agent)
  •  AzureCLI@2: Run databricks bundle validate --target ${{ parameters.environment }} (This step creates the .databricks/bundle/<target> directory on the agent)
  • CopyFiles@2: Copy the ${{ parameters.environment }}/variable-overrides.json file into the .databricks/bundle/<target> directory
  •  AzureCLI@2: Run databricks bundle deploy --target ${{ parameters.environment }}

This approach works well for me and let's you maintain environment specific variables in dedicated json files. Common configurations can either be maintained directly in the job.yml or if they're shared across multiple jobs in the variable section of the databricks.yml.

 

View solution in original post

3 REPLIES 3

-werners-
Esteemed Contributor III

What I did was creating an additional yaml file containing all this global config which is used in all bundles.
f.e. in a folder /common or /global you can define a 'globalconf.yml'.
In this file you define your global variables (in the variables section, like in databricks.yml), and you can even define your targets here (except for workspace URL, that is not permitted last time I checked).

Now, this file you include into your databricks.yml using the include section:

include:
- ../common/globalconf.yml
- resources/*.yml

databricks.yml is now able to read the content of globalconf.yml.
We still have to pass this to the resources dir (job definition).  This can be done by defining variables in the databricks yaml which are filled with values of globalconf yaml.
(the files that reside in /resources cannot access the globalconf yaml file, only databricks yaml).
Like this you can put a ton of config into a global file.
Not sure if it makes sense.

 

FedeRaimondi
Contributor

I believe this page is the most interesting to clarify your doubts: Substitutions and variables in Databricks Asset Bundles | Databricks Documentation.

I will try to adapt it to your use case, where I guess you are already adding your bundles variables in databricks.yml.

bundle:
  name: databricks_jobs

include:
  - resources/*.yml
  - resources/*/*.yml

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://dev-workspace.azuredatabricks.net
    variables:
      your_variable: value-dev

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.azuredatabricks.net
    variables:
      your_variable: value-prod

variables:
  your_variable:
    description: Description.
    default: default-value

Here you can define all your variables and need to specify them for each target group. Then as you did already, you access them in your resources by `${var.your_variable}`

When using bundles in your CI/CD tool then you have a few options to overwrite those variables:

  1. give the variable value in databricks bundle cli command:
databricks bundle validate --var="your_variable=new-value"
  • Set environment an variable: 
export BUNDLE_VAR_your_variable=new-value​
  • Use .databricks/bundle/<target>/variable-overrides.json file with content:
{"your_variable": "new-value"}

Then you can always create complex variables (a variable with subfields), and method 3 allows you to overwrite them.

Be aware that there is a priority!

If you need to use the 3rd approach in your DevOps pipeline, then make sure to create the file if it doesn't exist, for example with a bash script step:

mkdir -p .databricks/bundle/dev
echo '{ "your_variable": "new-value" }' > .databricks/bundle/dev/variable-overrides.json

To conclude here is my view on variables:

I personally create a bunch of custom variables in my bundle and usually there are some that I won't change at run time but they change based on the target environment. For example catalog, schema and others. These I'll keep in each target definition.

Then I have Azure DevOps libraries (one per target environment) in which I can save safely authentication things (host, client_id, client_secret). These then I manage with DevOps stages and get the one I need based on your deploy or release strategy.

Finally, I have very few variables I want to modify when I use "databricks deploy" so I use method 1) described above and for instance I pass a git_sha for traceability which could be used as a tag for a job or a parameter in your entrypoint. 

esistfred
New Contributor III

It does. Thanks for the reponse. I also continued playing around with it and found a way using the variable-overrides.json file. I'll leave it here just in case anyone is interested:

Repository layout:

databricks/
ā”œā”€ā”€ notebooks/
│   └── notebook.ipynb
ā”œā”€ā”€ resources/
│   └── job.yml
ā”œā”€ā”€ variables/
│   └── dev/
│   │   └── variable-overrides.json
│   └── prod/
│   │   └── variable-overrides.json
ā”œā”€ā”€ databricks.yml

 The job.yml looks like this:

  jobs:
    databricks_job:
      name: databricks_job
      max_concurrent_runs: 1
      
      schedule:
        quartz_cron_expression: "0 */5 * * * ?"
        timezone_id: UTC
        pause_status: ${var.pause_status}

      tasks:
        - task_key: notebook_task
          job_cluster_key: job_cluster
          notebook_task:
            notebook_path: ../notebooks/notebook.ipynb

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster: ${var.job_cluster}

      parameters: ${var.parameters}

The databricks.yml looks like this:

bundle:
  name: databricks_jobs

include:
  - resources/*.yml
  - resources/*/*.yml

variables:
  pause_status:
    description: Pause status of the job
    type: string
    default: "" // All declared variables are required to have a default value
  job_cluster:
    description: Configuration for the job cluster
    type: complex
    default: {} // of type map
  parameters:
    description: Parameters for the job
    type: complex
    default: [] // of type sequence

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://dev-workspace.azuredatabricks.net

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.azuredatabricks.net

A variable-overrides.json would look like this:

{
    "pause_status": "PAUSED",
    "job_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "Standard_D3_v2",
        "data_security_mode": "SINGLE_USER",
        "autoscale": {
            "min_workers": 1,
            "max_workers": 4
        }
    },
    "parameters": [
        {
            "name": "parameter_key",
            "default": "parameter_value"
        }
    ]
}

The Azure DevOps deployment pipeline has the following layout:

  • Bash@3: Install Databricks CLI (if not installed on the agent)
  •  AzureCLI@2: Run databricks bundle validate --target ${{ parameters.environment }} (This step creates the .databricks/bundle/<target> directory on the agent)
  • CopyFiles@2: Copy the ${{ parameters.environment }}/variable-overrides.json file into the .databricks/bundle/<target> directory
  •  AzureCLI@2: Run databricks bundle deploy --target ${{ parameters.environment }}

This approach works well for me and let's you maintain environment specific variables in dedicated json files. Common configurations can either be maintained directly in the job.yml or if they're shared across multiple jobs in the variable section of the databricks.yml.