cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Suggestion: allow existing_cluster_id override from run_job_task and for_each_task

kenmyers-8451
Contributor

I'm not sure if this feature exists in newer versions of databricks CLI (doubtful because this doesn't seem possible in the UI either but if it helps my team has been on 0.222.0 for a while because it has been stable enough for us) and maybe this is a niche use case but perhaps others would agree with this idea. The short of it is we'd like to be able to override the existing_cluster_id in a workflow when called from run_job_task and for_each_task. My first thought was maybe this can be done with a job parameter like the following:

      tasks:
        - task_key: merge_serverless
          existing_cluster_id: ${job.parameters.cluster_override}
          [... task details ...]
      parameters:
        - name: cluster_override
          default: ${var.all_purpose_UC_cluster_ID}

But this doesn't work, I get an error that says 

5517:             "existing_cluster_id": "${job.parameters.cluster_override}",

A managed resource "job" "parameters" has not been declared in the root
module.

Note: I've tried this with "{{job.parameters.cluster_override}}" and get a similar error.

So it seems like it tries to set up the cluster details before the job parameters have been resolved. Is there another way to do something like this or could this be put on the roadmap as something to have in the future?

Here is some more background for why we are looking to do this: My team has two main components to our project that come together to make one whole. Those two components run with separate clusters in order to 1) separate work and 2) to track the costs of each part separately. However the two components share some workflows because the workflows were written to be generic enough to be used anywhere. These workflows are kicked off by things like run_job_task and for_each_task. The problem we have though is that those workflows are hardcoded to a single existing_cluster_id. This presents two problems:

1. we are not able to separate the costs between the two components.

2. we have to wait for the clusters to start up. So for example the generic workflow is hardcoded to use clusters from the component 1, now we want to kick it off in the component 2 which is already running with its specific clusters. But now the pipeline waits to start up component 1's clusters. and we have gaps in our pipelines long enough where these component 1 clusters are starting up, shutting down, then starting up again so it is adding maybe 20 minutes to the runtime. 

So if we could pass a cluster override id to everything run by run_job_task and for_each_task we could tell it to use the clusters that are already hot, save some time, and track the costs correctly.

 

1 REPLY 1

WiliamRosa
New Contributor III

Hi @kenmyers-8451, I’m also not sure if this feature exists in newer versions. Since no one else has replied, I’d suggest raising a ticket with the Databricks Support Team — they’ll be able to provide clarity on this topic:

http://help.databricks.com/s/contact-us?ReqType=training

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now