Hello,
Our team recently experienced an issue where a teammate started a new workflow job then went on vacation. This job ended up running continuously without failing for 4.5 days. The usage of the cluster did not seem out of place during the workday since we are all putting load on it, it's an all purpose cluster that runs jobs at night so it's normal if it is on with high usage then, and we're not looking at the cluster outside of hours. Only during one of our meetings did someone notice that it was running at max for an unusually long time. We determined the culprit, and we are aware of the ability to add max timeouts for jobs, but we still feel there should be safeguards in place to prevent an occurrence like this in the future.
Is there a way to send a notification or alert in the event the cluster has been running for X period of time without termination?
Thank you!