Databricks Community

kurtrm · ‎08-22-2023

Hello,

Our team recently experienced an issue where a teammate started a new workflow job then went on vacation. This job ended up running continuously without failing for 4.5 days. The usage of the cluster did not seem out of place during the workday since we are all putting load on it, it's an all purpose cluster that runs jobs at night so it's normal if it is on with high usage then, and we're not looking at the cluster outside of hours. Only during one of our meetings did someone notice that it was running at max for an unusually long time. We determined the culprit, and we are aware of the ability to add max timeouts for jobs, but we still feel there should be safeguards in place to prevent an occurrence like this in the future.

Is there a way to send a notification or alert in the event the cluster has been running for X period of time without termination?

Thank you!

kurtrm · ‎08-23-2023

Hi @Retired_mod,

Thank you for your reply. I marked your response as the solution; however, my company must use a private Databricks deployment due to the nature of its business and is missing many of the features available in the latest release of "normal" Databricks. This appears to be one of them as I don't see any of the options listed in the instructions to add duration warnings when editing notifications for a job.

kurtrm · ‎08-24-2023

@Retired_mod,

I ended up creating a job leveraging the Databricks Python SDK to check cluster and active job run times. The script will raise an error and notify the team if the cluster hasn't terminated or restarted in the past 24 hours or if a job has been running in excess of 6 hours. Thank you again for your help!

Kurt

Databricks Community

How to send alert when cluster is running for too long

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon