cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DAB - Common cluster configs possible?

RobCox
New Contributor II

I've been trying various solutions and perhaps maybe just thinking about this the wrong way.

We're migrating over from Synapse where we're used to have a defined set of DBX Cluster profiles to run our jobs against, these are all job clusters created via API so basically act as templates for us.

Now we're moving over to asset bundles, I'm trying to look at how we can have this "common" set of clusters for each of our DAB repos to use, to have some uniformity between them.

Something I was aiming for 

- Define all the cluster types in a single file (i.e. clusters.yml)
- Allow per target -> task to override a default by providing simply the cluster name i.e. "Driver_Only_DSV3"

I have this working with on-demand clusters by leveraging existing_cluster_id and parameterising this, but to use job clusters it seems like you must register all job_clusters, and with 10-15 cluster variants defining those for each job for each repo is a lot of noise, and haven't actually got a solution working using this method.


2 REPLIES 2

saurabh18cs
Honored Contributor

hi, you can also parametrize your job clusters ??

 

job_clusters:
      - job_cluster_key: Job_cluster
        new_cluster:
          spark_version: ${var.spark_version}
          spark_conf: ${var.spark_configuration}
          azure_attributes:
            first_on_demand: 1
            availability: ON_DEMAND_AZURE
            spot_bid_max_price: -1
          node_type_id: ${var.cluster_node_type_id}
          spark_env_vars:
            PYSPARK_PYTHON: /databricks/python3/bin/python3
            LOG_LEVEL: DEBUG
            BUNDLE_ROOT_DIR: ${workspace.file_path}
          enable_elastic_disk: true
          data_security_mode: SINGLE_USER
          num_workers: ${var.cluster_worker_nodes}
          instance_pool_id: ${var.executor_instance_pool_id}
          driver_instance_pool_id: ${var.driver_instance_pool_id}

RobCox
New Contributor II

This is one option yes, but ideally looking for to say

- Define cluster types once, usable by N packages that use databricks asset bundles
- Allow the bundle to just simply say run JOB.TASK step as Cluster_Type_1 for Production

Primary reasons for this is say we have common tagging strategies to apply on our clusters irrespective of asset bundle being deployed, and also common cluster configurations / spark conf setups.

A nice simple "middle ground" would be a common clusters.yml that can drop into any bundle if we decided to change the cluster_type_1 configuration and needed to replace it in 15 repos it would be nice and easy to change that one file.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now