topic Variable Compute clusters within a Job in Data Engineering

Variable Compute clusters within a Job

allyallen — Fri, 11 Jul 2025 14:48:03 GMT

We have 3 possible compute clusters that we can run a notebook against.
They are varying sizes and the one that the notebook uses will depend on the size of the data being processed.

We "t-shirt size" each tenant base on their data size (S, M, L) and can read this config in from Postgres in a notebook.
Once we know the t-shirt size, is there a way of setting the compute cluster dynamically in subsequent tasks?
e.g. a tenant is size M so the rest of the tasks in the job run on the M cluster

We'd like to avoid duplicating jobs/tasks!
Thanks in advance

Re: Variable Compute clusters within a Job

eniwoke — Fri, 11 Jul 2025 20:11:52 GMT

Hi @allyallen, just to clarify your use case to see if I can provide a solution:

Are you saying you have a single job with multiple tasks, and each of those tasks runs the same notebook (e.g., notebook_1), but you'd like the compute cluster to vary depending on the tenant's t-shirt size (S, M, L) determined within the notebook and a task?

Or is it more that you have a parent job (e.g., job_1) which dynamically triggers other jobs or notebooks, and you'd like each of those to run on the appropriate cluster based on the tenant’s size?

Re: Variable Compute clusters within a Job

allyallen — Tue, 15 Jul 2025 07:13:38 GMT

Hi @eniwoke

Thank you for replying!
I have one job that has a string of other jobs and notebooks as tasks. This job is designed to be run against different tenants as a way of ingesting data.

NB1 at the beginning of the job determines the t-shirt size for the tenant and if it's S, all subsequent tasks and jobs need to run on the S cluster. If NB1 finds the t-shirt size is M, all following tasks and jobs will run on the M cluster.
At the moment, I can only set one cluster per task and can't see a way of dynamically setting the cluster to use based on the output of a previous task.
Hope this clarifies the ask a little bit!

Thanks!

Re: Variable Compute clusters within a Job

eniwoke — Tue, 15 Jul 2025 17:14:02 GMT

Hi @allyallen, thanks for the explanation. Yes, you are right; there is no direct way to change the cluster for a task while within the same job. However, you can still achieve a somewhat similar result by making a few tweaks.

You can start by separating the job into separate jobs, say job_1 and job_2. The task that runs NB1 will be in job_1, and then the other tasks can be in job_2.

Since you already know the job name/id for job_2, you can use the update job settings to update the cluster for the job. Of course, the downside to this is that you'll need to know the job_id beforehand to and you'd be using either NB1 to update job_2's cluster. That's one approach.

Another approach is that you can create job_2 programmatically in NB1 every time the t-shirt size changes

Let me know if it helps 🙂

Re: Variable Compute clusters within a Job

allyallen — Wed, 16 Jul 2025 14:10:36 GMT

Hi @eniwoke

That's a great solution thank you so much!
Our process is now as follows:
NB1 gets the tenant t-shirt size and sets the cluster_id for each size as a variable.
The notebook then loops through each tenant and using the DataBricks API updates the tasks within the job to the right cluster_id and triggers a run of the main job.

After testing (with one tenant as S and one as M), the right job was triggered twice (once for each tenat) and each of those runs ran on the right sized cluster for the tenant in question.

It's just what we were after, thank you so so much for your help!
Ally

Re: Variable Compute clusters within a Job

eniwoke — Wed, 16 Jul 2025 15:33:02 GMT

Fantastic, I'm glad to hear it worked! 🙂