Hi everyone,
I recently dealt with a frustrating scenario: a Databricks job that usually takes minutes ran for 18 hours without failing, quietly consuming compute and blocking downstream pipelines.
The driver hadn't crashed, and the job hadn't failedโit was just "stuck." This led our team to realize that while "adding a timeout" sounds simple, picking a static number is often counterproductive (especially for jobs that run frequently or have seasonal variance).
I wrote a technical deep dive on how we moved away from "guesswork" timeouts to a per-job, data-driven, and schedule-aware approach using the Interquartile Range (IQR) method to filter noise.
Iโve shared the full logic and the story behind the 18-hour incident here: How an 18-Hour Databricks Job Run Led Us to Build Smarter Timeouts
I'd love to hear how others in the community handle job monitoring and if youโve built similar automated safeguards!