Databricks Community

jeremy98 · ‎04-16-2025

Hi community,

My team and I are using a job that is triggered based on dynamic scheduling, with the schedule defined within some of the job's tasks. However, this job is attached to a cluster that is always running and never terminated.

I understand that keeping a cluster constantly running might not be a best practice in Databricks, but this is currently how our setup works.

In the event of a failure—such as the cluster crashing due to excessive driver memory usage—what would be the recommended way to automatically restart the cluster and resume the related job? Also, is there a way to catch such errors programmatically to trigger a restart?

Thanks in advance!

RiyazAliM · ‎04-16-2025

hi @jeremy98

Fundamentally, changing some design patterns would help save cluster costs and job failures due to cluster crashing.

I understand that you're using a always running cluster to run your workflow, I'm not sure about the usecase but I'd suggest you to replace your always running cluster with a job cluster, which will only spin up when your job is started and would shut down when it's done. Check more here - https://docs.databricks.com/aws/en/compute

To answer your another concern of cluster craching, Databricks allows you to autoscale your clusters, set a min and max workers which will use more workers when there's a need for a heavy load. This will save the trouble of resuming the failed job and handling cluster crashings.

In the workflow UI, there's an option to set retries to the tasks, if any task is failed, it will restart the task after an interval of your choice.

Riz

jeremy98 · ‎04-29-2025

Hi,

thanks for your answer! We're trying to develop a job that is used to send emails to our clients. But, these exports need to be sent in the time decided by the clients. For e.g. If i want to receive every day an email, this job needs to be reschedule each time (dynamic scheduling in place, we already managed it using API calls that changes the schedule every time) and executed instantaneously.

So, we don't want to lose time on waiting the job compute to be up after 5-7 minutes.

RiyazAliM · ‎04-29-2025

Hey @jeremy98

Have you had a chance to experiment with Databricks Serverless offering? Ideally, serverless would spin up times are around ~1 min. It has inbuilt autoscaling based on the workload, seems good fit for your usecase. Check out more info from the link below:

https://docs.databricks.com/aws/en/jobs/run-serverless-jobs

Also to answer your initial question, `what would be the recommended way to automatically restart the cluster and resume the related job? Also, is there a way to catch such errors programmatically to trigger a restart?`

- After running your job, with a delay of x mins (time taken to finish your job), using Jobs API, let's check if the job is a success or not.

If the result_state is failed, trigger the job again, make sure your interactive cluster is running when you submit the run job request.

Let me know if any questions.

Riz

jeremy98 · ‎04-30-2025

Hi,

Thanks for you answer, but I was talking to another issue. In the case, using serveless compute takes some minutes to install the packages and this is not good since our job is based on different tasks and if they are serveless takes every time some minutes to install.. for this reason we wanted to use a cluster that is always up.

Databricks Community

how to fallback the entire job in case of failure of the cluster?

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples