I'd recommend this new tool we've been trying out. It's really helpful for monitoring and provides good insights on how Azure Databricks clusters, pools & jobs are doing โ like if they're healthy or having issues. It brings everything together, making it easier to figure out what's going wrong and fix it faster.
Personally I also feel they seem focused on Databricks but ultimately whatever tool provides your team the best visibility and automation should take priority. Check this video, maybe this might also help - https://www.youtube.com/watch?v=vD0S3h7ZJNU&t=355s