Databricks Community

jar · ‎04-14-2025

Hey.

I am testing a continuous workflow job which executes the same notebook, so rather simple and it works well. It seems like it re-creates the job cluster for every iteration, instead of just re-using the one created at the first execution. Is that really the case? If yes, is there a setting I am overlooking or something?

Best,

Johan.

RefactorDuncan · ‎04-16-2025

Hi,

Below is an example code snippet illustrating my current approach. I use dbutils.notebook.exit to terminate the notebook execution either when a predefined stop time is reached or after a set number of iterations in the while loop.

When dbutils.notebook.exit is triggered, the job run stops. Since the job is set on a continuous schedule, a new job run is automatically started immediately afterward.

max_job_duration = 14400 # in seconds
num_completed_run = 0
time_restart_job =  datetime.now() + timedelta(seconds=max_job_duration)

while True:
   time_current = datetime.now()
   if time_current >= time_restart_job or num_completed_run >= num_max_run:
      # Exit the loop to allow job restart
      dbutils.notebook.exit(f"Exited notebook at {time_current}.")

   num_completed_run += 1

View solution in original post

Brahmareddy · ‎04-14-2025

Hi jar,

How are you doing today?, as per my understanding, You're absolutely right in your observation—Databricks will create a new job cluster for each run of the job, even in a continuous workflow, unless you’re using an all-purpose cluster (which isn't ideal for cost or isolation in production). Job clusters are ephemeral by design, meaning they spin up for the run and shut down once it's done, to ensure a clean environment each time. Right now, there’s no built-in setting to keep the same job cluster alive across multiple runs in a looped workflow. If you want to truly reuse a cluster across iterations, you'd need to point your job to an existing all-purpose cluster manually—but that does trade off isolation and can increase risk of leftover state between runs. For most use cases, letting the job cluster restart each time is safer, even if it adds some overhead. Let me know if you want to explore workflow alternatives to help minimize startup time!

Regards,

Brahma

RefactorDuncan · ‎04-15-2025

Hi,

@Brahmareddy is right — I’ve encountered the same issue. Even when using a continuous job, I still experience the overhead of compute restarting after each run completes.

As a temporary workaround (until the more cost-effective serverless update is available), I’ve created a main notebook that uses dbutils.notebook.run inside a while loop to handle orchestration. This loop runs continuously but breaks every few hours to force a compute restart. Because it's a single-task notebook set up as a continuous job, it immediately kicks off a new run after exiting.

I’ve also experimented with compute pools, but they seem to introduce a similar level of overhead.

This setup is far from ideal, but it works for now as we await future improvements from Databricks.

Aviral-Bhardwaj · ‎04-15-2025

use dbutils.notebook.run inside a while loop to handle orchestration

AviralBhardwaj

jar · ‎04-15-2025

Thank you all for your answers!

I did use dbutils.notebook.run() inside a while-loop at first but ultimately would run into OOM errors, even if I tried writing in a clearing of cache after each iteration. I'm curious @RefactorDuncan, if you don't mind explaining, how did you break and restart?

RefactorDuncan · ‎04-16-2025

Hi,

Below is an example code snippet illustrating my current approach. I use dbutils.notebook.exit to terminate the notebook execution either when a predefined stop time is reached or after a set number of iterations in the while loop.

When dbutils.notebook.exit is triggered, the job run stops. Since the job is set on a continuous schedule, a new job run is automatically started immediately afterward.

max_job_duration = 14400 # in seconds
num_completed_run = 0
time_restart_job =  datetime.now() + timedelta(seconds=max_job_duration)

while True:
   time_current = datetime.now()
   if time_current >= time_restart_job or num_completed_run >= num_max_run:
      # Exit the loop to allow job restart
      dbutils.notebook.exit(f"Exited notebook at {time_current}.")

   num_completed_run += 1