cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

how to fallback the entire job in case of failure of the cluster?

jeremy98
Honored Contributor

Hi community,

My team and I are using a job that is triggered based on dynamic scheduling, with the schedule defined within some of the job's tasks. However, this job is attached to a cluster that is always running and never terminated.

I understand that keeping a cluster constantly running might not be a best practice in Databricks, but this is currently how our setup works.

In the event of a failure—such as the cluster crashing due to excessive driver memory usage—what would be the recommended way to automatically restart the cluster and resume the related job? Also, is there a way to catch such errors programmatically to trigger a restart?

Thanks in advance!

4 REPLIES 4

RiyazAliM
Honored Contributor

hi @jeremy98 

Fundamentally, changing some design patterns would help save cluster costs and job failures due to cluster crashing.

I understand that you're using a always running cluster to run your workflow, I'm not sure about the usecase but I'd suggest you to replace your always running cluster with a job cluster, which will only spin up when your job is started and would shut down when it's done. Check more here - https://docs.databricks.com/aws/en/compute

To answer your another concern of cluster craching, Databricks allows you to autoscale your clusters, set a min and max workers which will use more workers when there's a need for a heavy load. This will save the trouble of resuming the failed job and handling cluster crashings.

In the workflow UI, there's an option to set retries to the tasks, if any task is failed, it will restart the task after an interval of your choice.

Riz

jeremy98
Honored Contributor

Hi,

thanks for your answer! We're trying to develop a job that is used to send emails to our clients. But, these exports need to be sent in the time decided by the clients. For e.g. If i want to receive every day an email, this job needs to be reschedule each time (dynamic scheduling in place, we already managed it using API calls that changes the schedule every time) and executed instantaneously.

So, we don't want to lose time on waiting the job compute to be up after 5-7 minutes.

RiyazAliM
Honored Contributor

Hey @jeremy98 

Have you had a chance to experiment with Databricks Serverless offering? Ideally, serverless would spin up times are around ~1 min. It has inbuilt autoscaling based on the workload, seems good fit for your usecase. Check out more info from the link below:

https://docs.databricks.com/aws/en/jobs/run-serverless-jobs

Also to answer your initial question, `what would be the recommended way to automatically restart the cluster and resume the related job? Also, is there a way to catch such errors programmatically to trigger a restart?`

- After running your job, with a delay of x mins (time taken to finish your job), using Jobs API, let's check if the job is a success or not. 

aayrm5_0-1745987511598.png

If the result_state is failed, trigger the job again, make sure your interactive cluster is running when you submit the run job request.

Let me know if any questions.

Riz

jeremy98
Honored Contributor

Hi,

Thanks for you answer, but I was talking to another issue. In the case, using serveless compute takes some minutes to install the packages and this is not good since our job is based on different tasks and if they are serveless takes every time some minutes to install.. for this reason we wanted to use a cluster that is always up.