Hey @jeremy98
Have you had a chance to experiment with Databricks Serverless offering? Ideally, serverless would spin up times are around ~1 min. It has inbuilt autoscaling based on the workload, seems good fit for your usecase. Check out more info from the link below:
https://docs.databricks.com/aws/en/jobs/run-serverless-jobs
Also to answer your initial question, `what would be the recommended way to automatically restart the cluster and resume the related job? Also, is there a way to catch such errors programmatically to trigger a restart?`
- After running your job, with a delay of x mins (time taken to finish your job), using Jobs API, let's check if the job is a success or not.

If the result_state is failed, trigger the job again, make sure your interactive cluster is running when you submit the run job request.
Let me know if any questions.
Riz