Hi Databricks Community,
I have some real-life use-case that I like to achieve as soon possible and that is the reason, I am reaching to you for implementation guidelines/Ideas/Documentations/Best Practices.
Assume I am an IT manager and in my production Databricks environment I have many multi-purpose clusters as well as job clusters. As a Databricks production environment manager I like to monitor its usage, status, errors from a dashboard and email notification with as easy as possible way. Dashboard that is filled with all Key information for utilization status, quick fault finding and cost reduction, etc.
I hope I am not asking anything unpractical. Please give your inputs if possible.
Major points that I like to cover:
1> Is my Databricks clusters are under-utilized or over utilized?
2> If my Databricks clusters are over utilized, which process or what set of queries or what particular user or what time frame resource consumption is high?
2.a> Any particular set of queries creating any issue?
3> Assume If one of my clusters has โ1โ as min node and โ20โ as max node, then how much time node utilization is staying above 70% (or any other %) or utilization trend-lines?
4> Notification like cluster restart or terminated or one particular job failed consecutive x-number of times, etc.
5> Any such thing that should be monitored or controlled with immediate effect.