cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to send alert when cluster is running for too long

kurtrm
New Contributor III

Hello,

Our team recently experienced an issue where a teammate started a new workflow job then went on vacation. This job ended up running continuously without failing for 4.5 days. The usage of the cluster did not seem out of place during the workday since we are all putting load on it, it's an all purpose cluster that runs jobs at night so it's normal if it is on with high usage then, and we're not looking at the cluster outside of hours. Only during one of our meetings did someone notice that it was running at max for an unusually long time. We determined the culprit, and we are aware of the ability to add max timeouts for jobs, but we still feel there should be safeguards in place to prevent an occurrence like this in the future.

 

Is there a way to send a notification or alert in the event the cluster has been running for X period of time without termination?

Thank you!

2 REPLIES 2

kurtrm
New Contributor III

Hi @Retired_mod,

Thank you for your reply. I marked your response as the solution; however, my company must use a private Databricks deployment due to the nature of its business and is missing many of the features available in the latest release of "normal" Databricks. This appears to be one of them as I don't see any of the options listed in the instructions to add duration warnings when editing notifications for a job. 

kurtrm
New Contributor III

@Retired_mod,

I ended up creating a job leveraging the Databricks Python SDK to check cluster and active job run times. The script will raise an error and notify the team if the cluster hasn't terminated or restarted in the past 24 hours or if a job has been running in excess of 6 hours. Thank you again for your help!

Kurt

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group