cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Asset Bundles: Extending complex variables

Daniel_dlh
New Contributor II

Hi all,

In my Asset Bundle I have some setting for a cluster like in the example at Substitutions and variables in Databricks Asset Bundles (section Define a complex variable).

Now I want to add some additional attribute when using this variables, like this:

resources:
jobs:
my_job:
job_clusters:
- job_cluster_key: my_cluster_key
new_cluster:
<<: ${var.my_cluster}
custom_tags:
foo: bar
tasks:
- task_key: hello_task
job_cluster_key: my_cluster_key

Unfortunately this results in the error "map merge requires map or sequence of maps as the value".

I assume that the YAML map operator is applied first (so while the value is still a string - and thus resulting in the error) and only afterwards the variable would be replaced.

Have you come around a similar problem? If yes, how did you solve it?

Thanks and regards!

1 ACCEPTED SOLUTION

Accepted Solutions

Pat
Esteemed Contributor

Hi @Daniel_dlh ,

you can try to use YAML anchor, have a look at this example:

before:

 

databricks.yml
variables:
  my_cluster:
    description: "Base cluster configuration"
    default:
      spark_version: "15.4.x-scala2.12"
      node_type_id: "Standard_DS3_v2"
      num_workers: 1

# my_jobs.yml
resources:
  jobs:
    sample_etl_job:
      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            <<: ${var.my_cluster}  # ERROR: map merge requires map or sequence of maps
            custom_tags:
              environment: ${bundle.target}

with anchor:

# Define base cluster as YAML anchor
definitions:
  base_cluster: &base_cluster
    spark_version: "15.4.x-scala2.12"
    node_type_id: "Standard_DS3_v2"
    num_workers: 1
    spark_conf:
      spark.databricks.cluster.profile: "serverless"
      spark.master: "local[*, 4]"

resources:
  jobs:
    sample_etl_job:
      name: sample_etl_job
      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            <<: *base_cluster
            custom_tags:
              environment: ${bundle.target}
              project: dlt_telco
      tasks:
        - task_key: etl_task
          job_cluster_key: etl_cluster
          spark_python_task:
            python_file: ../src/sample_etl.py
      schedule:
        quartz_cron_expression: "0 0 1 * * ?"
        timezone_id: "UTC"
      max_concurrent_runs: 1
      timeout_seconds: 3600

 or explicit fields:

# Serverless Jobs Configuration
resources:
  jobs:
    sample_etl_job:
      name: sample_etl_job
      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            spark_version: ${var.my_cluster.spark_version}
            node_type_id: ${var.my_cluster.node_type_id}
            num_workers: ${var.my_cluster.num_workers}
            spark_conf: ${var.my_cluster.spark_conf}
            custom_tags:
              environment: ${bundle.target}
              project: dlt_telco
      tasks:
        - task_key: etl_task
          job_cluster_key: etl_cluster
          spark_python_task:
            python_file: ../src/sample_etl.py
      schedule:
        quartz_cron_expression: "0 0 1 * * ?"
        timezone_id: "UTC"
      max_concurrent_runs: 1
      timeout_seconds: 3600

The key concept: YAML anchors only work within a single file, so if you want to share an anchor across multiple jobs, you must put the anchor definition AND all the jobs that use it in the SAME file

View solution in original post

1 REPLY 1

Pat
Esteemed Contributor

Hi @Daniel_dlh ,

you can try to use YAML anchor, have a look at this example:

before:

 

databricks.yml
variables:
  my_cluster:
    description: "Base cluster configuration"
    default:
      spark_version: "15.4.x-scala2.12"
      node_type_id: "Standard_DS3_v2"
      num_workers: 1

# my_jobs.yml
resources:
  jobs:
    sample_etl_job:
      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            <<: ${var.my_cluster}  # ERROR: map merge requires map or sequence of maps
            custom_tags:
              environment: ${bundle.target}

with anchor:

# Define base cluster as YAML anchor
definitions:
  base_cluster: &base_cluster
    spark_version: "15.4.x-scala2.12"
    node_type_id: "Standard_DS3_v2"
    num_workers: 1
    spark_conf:
      spark.databricks.cluster.profile: "serverless"
      spark.master: "local[*, 4]"

resources:
  jobs:
    sample_etl_job:
      name: sample_etl_job
      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            <<: *base_cluster
            custom_tags:
              environment: ${bundle.target}
              project: dlt_telco
      tasks:
        - task_key: etl_task
          job_cluster_key: etl_cluster
          spark_python_task:
            python_file: ../src/sample_etl.py
      schedule:
        quartz_cron_expression: "0 0 1 * * ?"
        timezone_id: "UTC"
      max_concurrent_runs: 1
      timeout_seconds: 3600

 or explicit fields:

# Serverless Jobs Configuration
resources:
  jobs:
    sample_etl_job:
      name: sample_etl_job
      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            spark_version: ${var.my_cluster.spark_version}
            node_type_id: ${var.my_cluster.node_type_id}
            num_workers: ${var.my_cluster.num_workers}
            spark_conf: ${var.my_cluster.spark_conf}
            custom_tags:
              environment: ${bundle.target}
              project: dlt_telco
      tasks:
        - task_key: etl_task
          job_cluster_key: etl_cluster
          spark_python_task:
            python_file: ../src/sample_etl.py
      schedule:
        quartz_cron_expression: "0 0 1 * * ?"
        timezone_id: "UTC"
      max_concurrent_runs: 1
      timeout_seconds: 3600

The key concept: YAML anchors only work within a single file, so if you want to share an anchor across multiple jobs, you must put the anchor definition AND all the jobs that use it in the SAME file