Hi community,
My team and I have been occasionally experiencing INTERNAL_ERROR events in Databricks. We have a job that runs on a schedule, but the start times vary. Sometimes, when the job is triggered, the underlying cluster fails to start for some reason.
Iโd like some advice on how to better investigate these issues and how to set up a mitigation or fallback mechanism. Specifically, I want a way to detect when the job starts but the cluster cannot initialize, and then run an alternative process or alert.
Any suggestions or best practices would be greatly appreciated!