How to send alert when cluster is running for too long
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2023 04:31 PM
Hello,
Our team recently experienced an issue where a teammate started a new workflow job then went on vacation. This job ended up running continuously without failing for 4.5 days. The usage of the cluster did not seem out of place during the workday since we are all putting load on it, it's an all purpose cluster that runs jobs at night so it's normal if it is on with high usage then, and we're not looking at the cluster outside of hours. Only during one of our meetings did someone notice that it was running at max for an unusually long time. We determined the culprit, and we are aware of the ability to add max timeouts for jobs, but we still feel there should be safeguards in place to prevent an occurrence like this in the future.
Is there a way to send a notification or alert in the event the cluster has been running for X period of time without termination?
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-23-2023 07:05 AM
Hi @Retired_mod,
Thank you for your reply. I marked your response as the solution; however, my company must use a private Databricks deployment due to the nature of its business and is missing many of the features available in the latest release of "normal" Databricks. This appears to be one of them as I don't see any of the options listed in the instructions to add duration warnings when editing notifications for a job.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-24-2023 07:20 AM
I ended up creating a job leveraging the Databricks Python SDK to check cluster and active job run times. The script will raise an error and notify the team if the cluster hasn't terminated or restarted in the past 24 hours or if a job has been running in excess of 6 hours. Thank you again for your help!
Kurt