Databricks Community

ashdam · ‎11-02-2023

Here is my bundle definition

Spoiler

# This is a Databricks asset bundle definition for my_project.

# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.

experimental:

python_wheel_wrapper: true

bundle:

include:

- resources/*.yml

targets:

# The 'dev' target, used for development purposes.

# Whenever a developer deploys using 'dev', they get their own copy.

dev:

# We use 'mode: development' to make sure everything deployed to this target gets a prefix

# like '[dev my_user_name]'. Setting this mode also disables any schedules and

# automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.

mode: development

default: true

compute_id: xxxxx-yyyyyyyy-zzzzzzz

workspace:

host: https://adb-1033997625577373.13.azuredatabricks.net

# Optionally, there could be a 'staging' target here.

# (See Databricks docs on CI/CD at https://docs.databricks.com/dev-tools/bundles/index.html.)

#

# staging:

# workspace:

# host: https://adb-1033997625577373.13.azuredatabricks.net

# The 'prod' target, used for production deployment.

prod:

# For production deployments, we only have a single copy, so we override the

# workspace.root_path default of

# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}

# to a path that is not specific to the current user.

mode: production

workspace:

host: https://adb-1033997625577373.13.azuredatabricks.net

root_path: /Shared/.bundle/prod/${bundle.name}

run_as:

# This runs as gonzalomoran@ppg.com in production. Alternatively,

# a service principal could be used here using service_principal_name

# (see Databricks documentation).

user_name: gonzalomoran@ppg.com

# This is a Databricks asset bundle definition for my_project.# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.experimental: python_wheel_wrapper: truebundle: name: my_projectinclude: - resources/*.ymltargets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true compute_id: xxxxx-yyyyyyyy-zzzzzzz workspace: host: https://adb-1033997625577373.13.azuredatabricks.net # Optionally, there could be a 'staging' target here. # (See Databricks docs on CI/CD at https://docs.databricks.com/dev-tools/bundles/index.html.) # # staging: # workspace: # host: https://adb-1033997625577373.13.azuredatabricks.net # The 'prod' target, used for production deployment. prod: # For production deployments, we only have a single copy, so we override the # workspace.root_path default of # /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name} # to a path that is not specific to the current user. mode: production workspace: host: https://adb-1033997625577373.13.azuredatabricks.net root_path: /Shared/.bundle/prod/${bundle.name} run_as: # This runs as gonzalomoran@ppg.com in production. Alternatively, # a service principal could be used here using service_principal_name # (see Databricks documentation). user_name: gonzalomoran@ppg.com

My user has no rights to create new cluster but job definition tries to create a new one

Spoiler

# The main job for my_project

resources:

jobs:

my_project_job:

schedule:

quartz_cron_expression: '44 37 8 * * ?'

timezone_id: Europe/Amsterdam

email_notifications:

on_failure:

- gonzalomoran@ppg.com

tasks:

- task_key: notebook_task

job_cluster_key: job_cluster

notebook_task:

notebook_path: ../src/notebook.ipynb

- task_key: refresh_pipeline

depends_on:

- task_key: notebook_task

pipeline_task:

pipeline_id: ${resources.pipelines.my_project_pipeline.id}

- task_key: main_task

depends_on:

- task_key: refresh_pipeline

job_cluster_key: job_cluster

python_wheel_task:

package_name: my_project

entry_point: main

libraries:

# By default we just include the .whl file generated for the my_project package.

# See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html

# for more information on how to add other libraries.

- whl: ../dist/*.whl

job_clusters:

- job_cluster_key: job_cluster

new_cluster:

spark_version: 13.3.x-scala2.12

node_type_id: Standard_D3_v2

autoscale:

min_workers: 1

max_workers: 4

# The main job for my_projectresources: jobs: my_project_job: name: my_project_job schedule: quartz_cron_expression: '44 37 8 * * ?' timezone_id: Europe/Amsterdam email_notifications: on_failure: - gonzalomoran@ppg.com tasks: - task_key: notebook_task job_cluster_key: job_cluster notebook_task: notebook_path: ../src/notebook.ipynb - task_key: refresh_pipeline depends_on: - task_key: notebook_task pipeline_task: pipeline_id: ${resources.pipelines.my_project_pipeline.id} - task_key: main_task depends_on: - task_key: refresh_pipeline job_cluster_key: job_cluster python_wheel_task: package_name: my_project entry_point: main libraries: # By default we just include the .whl file generated for the my_project package. # See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html # for more information on how to add other libraries. - whl: ../dist/*.whl job_clusters: - job_cluster_key: job_cluster new_cluster: spark_version: 13.3.x-scala2.12 node_type_id: Standard_D3_v2 autoscale: min_workers: 1 max_workers: 4

I tried to remove the "job_clusters" lines but its complains about its missing. The other option is using "exiting_cumputing" within the job but this would conflic when I wanted to run the same job in production with another cluster.

Do you know how to use the defined cluster per defined target?

Regards

SvenG · ‎12-20-2023

Hi @Retired_mod,

Is it possible that you provide a minimal working example for option 1 or option 3?
I currently have a test jop:

"""

resources:

jobs:

my_project_job: #my_project_job_${bundle.target}

schedule:

quartz_cron_expression: '44 37 8 * * ?'

timezone_id: Europe/Amsterdam

tasks:

- task_key: notebook_task

existing_cluster_id: ${var.my_existing_cluster}

notebook_task:

notebook_path: ../src/notebook_${bundle.target}_test.ipynb

"""

with

"""

variables:

my_existing_cluster:

desciption: Id of my existing Cluster

default: 12345_my_id

"""

and i want to use a different cluster in prod and dev, however, the job that should be executed should remain the same.
Any ideas how i can solve this issue?