3 weeks ago
Hi there, running into some trouble abstracting job_clusters configurations to improve reusability. At the moment, I have many job yaml files that require the following configuration:
What would be the best approach(es) to remove this configuration from every job yaml file? Currently, we already have the following kinds of yaml files:
I did try creating a new `cluster_definitions.yml` as follows:
# Centralized cluster definitions for all fleet jobs.
# YAML anchors define reusable cluster profiles; each job references them via merge keys.
# DAB deep-merges these job_clusters with the tasks/parameters in individual fleet_*.yml files.
x-cluster-base: &cluster_base
spark_version: 16.4.x-scala2.12
spark_conf:
spark.databricks.cluster.profile: singleNode
spark.master: "local[*]"
spark.databricks.optimizer.collapseWindows.enabled: "false"
node_type_id: Standard_E4ds_v4
num_workers: 2
azure_attributes:
availability: ON_DEMAND_AZURE
first_on_demand: 1
spot_bid_max_price: -1
# spark_env_vars:
# ...
resources:
jobs:
fleet_wtg_ge_silver:
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
<<: *cluster_base
fleet_wtg_ge_curated:
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
<<: *cluster_base
fleet_wtg_sgre_silver:
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
<<: *cluster_base
fleet_wtg_sgre_curated:
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
<<: *cluster_base
fleet_wtg_vestas_silver:
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
<<: *cluster_base
fleet_wtg_vestas_curated:
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
<<: *cluster_baseBut when I tried running the deployment, I got the following error:
Error: multiple resources have been defined with the same key: fleet_wtg_sgre_curated
at jobs.fleet_wtg_sgre_curated
in cluster_definitions.yml:59:7
fleet_wtg_sgre_curated.yml:4:7
Error: multiple resources have been defined with the same key: feature_job_compute_cluster_fleet_wtg_ge_silver
at jobs.feature_job_compute_cluster_fleet_wtg_ge_silver
in fleet_wtg_ge_silver-v2.yml:4:7
fleet_wtg_ge_silver-v3.yml:4:7
Error: multiple resources have been defined with the same key: fleet_wtg_vestas_silver
at jobs.fleet_wtg_vestas_silver
in cluster_definitions.yml:64:7
fleet_wtg_vestas_silver.yml:4:7
Error: multiple resources have been defined with the same key: fleet_wtg_vestas_curated
at jobs.fleet_wtg_vestas_curated
in cluster_definitions.yml:69:7
fleet_wtg_vestas_curated.yml:4:7
Error: multiple resources have been defined with the same key: fleet_wtg_ge_silver
at jobs.fleet_wtg_ge_silver
in cluster_definitions.yml:44:7
fleet_wtg_ge_silver.yml:4:7
Error: multiple resources have been defined with the same key: fleet_wtg_ge_curated
at jobs.fleet_wtg_ge_curated
in cluster_definitions.yml:49:7
fleet_wtg_ge_curated.yml:4:7Would appreciate some help on this one!
2 weeks ago
Hello @ChristianRRL
My doubt about your issue is happening in cluster_definitions.yml because it is not only defining a reusable cluster profile it is also redefining the same jobs that already exist in the individual fleet_*.yml files.
Why ? because in DBKS asset bundles each entry under:
resources:
jobs:
<job_key>:must be unique in the final resolved bundle.
So if fleet_wtg_ge_silver exists in fleet_wtg_ge_silver.yml and also in cluster_definitions.yml, the bundle sees 2 resources with the same key and fails.
I tried to replicate your issue and I had that.
DBKS supports splitting bundle configuration across multiple YAML files using include but the included files are combined into one bundle config so you cannot redefine the same top level job resource twice.
Better thing to do is to define the cluster as a complex variable and reference it from each job.
# cluster_definitions.yml
variables:
fleet_job_cluster:
description: Shared fleet job cluster definition
type: complex
default:
spark_version: 16.4.x-scala2.12
node_type_id: Standard_E4ds_v4
num_workers: 2
azure_attributes:
availability: ON_DEMAND_AZURE
first_on_demand: 1
spot_bid_max_price: -1
spark_conf:
spark.databricks.cluster.profile: singleNode
spark.master: "local[*]"
spark.databricks.optimizer.collapseWindows.enabled: "false"then in each job file:
resources:
jobs:
fleet_wtg_ge_silver:
name: fleet_wtg_ge_silver
job_clusters:
- job_cluster_key: job_cluster
new_cluster: ${var.fleet_job_cluster}
tasks:
- task_key: Silver
notebook_task:
notebook_path: ../src/fleet/wtg_ge/silver/Silver.py
base_parameters:
task_name: "{{task.name}}"
source: WORKSPACE
job_cluster_key: job_clusteror simply use YAML anchors (but anchors are only practical when the anchor and the usage are in the same YAML document and they are not a good crossfile reuse mechanism for this case)
Also, this part caught my eyes :
fleet_wtg_ge_silver-v2.yml fleet_wtg_ge_silver-v3.yml
the bundle include pattern is picking up multiple versions of the same job so try to clean up the include pattern or move old test versions outside the included folder :
include: - resources/jobs/*.yml - resources/common/*.yml
and avoid including archived files such as *-v2.yml and *-v3.yml.
So you can do a structure like :
databricks.yml
resources/
common/
cluster_definitions.yml
jobs/
fleet_wtg_ge_silver.yml
fleet_wtg_ge_curated.yml
fleet_wtg_sgre_silver.ymlwith:
# databricks.yml include: - resources/common/*.yml - resources/jobs/*.yml
2 weeks ago
Hi everyone, quick comment on my Friday post for relevance. I would appreciate any help on this case.
Thanks!
2 weeks ago
Hello @ChristianRRL
My doubt about your issue is happening in cluster_definitions.yml because it is not only defining a reusable cluster profile it is also redefining the same jobs that already exist in the individual fleet_*.yml files.
Why ? because in DBKS asset bundles each entry under:
resources:
jobs:
<job_key>:must be unique in the final resolved bundle.
So if fleet_wtg_ge_silver exists in fleet_wtg_ge_silver.yml and also in cluster_definitions.yml, the bundle sees 2 resources with the same key and fails.
I tried to replicate your issue and I had that.
DBKS supports splitting bundle configuration across multiple YAML files using include but the included files are combined into one bundle config so you cannot redefine the same top level job resource twice.
Better thing to do is to define the cluster as a complex variable and reference it from each job.
# cluster_definitions.yml
variables:
fleet_job_cluster:
description: Shared fleet job cluster definition
type: complex
default:
spark_version: 16.4.x-scala2.12
node_type_id: Standard_E4ds_v4
num_workers: 2
azure_attributes:
availability: ON_DEMAND_AZURE
first_on_demand: 1
spot_bid_max_price: -1
spark_conf:
spark.databricks.cluster.profile: singleNode
spark.master: "local[*]"
spark.databricks.optimizer.collapseWindows.enabled: "false"then in each job file:
resources:
jobs:
fleet_wtg_ge_silver:
name: fleet_wtg_ge_silver
job_clusters:
- job_cluster_key: job_cluster
new_cluster: ${var.fleet_job_cluster}
tasks:
- task_key: Silver
notebook_task:
notebook_path: ../src/fleet/wtg_ge/silver/Silver.py
base_parameters:
task_name: "{{task.name}}"
source: WORKSPACE
job_cluster_key: job_clusteror simply use YAML anchors (but anchors are only practical when the anchor and the usage are in the same YAML document and they are not a good crossfile reuse mechanism for this case)
Also, this part caught my eyes :
fleet_wtg_ge_silver-v2.yml fleet_wtg_ge_silver-v3.yml
the bundle include pattern is picking up multiple versions of the same job so try to clean up the include pattern or move old test versions outside the included folder :
include: - resources/jobs/*.yml - resources/common/*.yml
and avoid including archived files such as *-v2.yml and *-v3.yml.
So you can do a structure like :
databricks.yml
resources/
common/
cluster_definitions.yml
jobs/
fleet_wtg_ge_silver.yml
fleet_wtg_ge_curated.yml
fleet_wtg_sgre_silver.ymlwith:
# databricks.yml include: - resources/common/*.yml - resources/jobs/*.yml
2 weeks ago
Using complex variable is the right suggestion. Thank you!