How we solved the "18-Hour Running Job" problem with Data-Driven Timeouts

Community Articles

Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.

Hi everyone,

I recently dealt with a frustrating scenario: a Databricks job that usually takes minutes ran for 18 hours without failing, quietly consuming compute and blocking downstream pipelines.

The driver hadn't crashed, and the job hadn't failed—it was just "stuck." This led our team to realize that while "adding a timeout" sounds simple, picking a static number is often counterproductive (especially for jobs that run frequently or have seasonal variance).

I wrote a technical deep dive on how we moved away from "guesswork" timeouts to a per-job, data-driven, and schedule-aware approach using the Interquartile Range (IQR) method to filter noise.

I’ve shared the full logic and the story behind the 18-hour incident here: How an 18-Hour Databricks Job Run Led Us to Build Smarter Timeouts

I'd love to hear how others in the community handle job monitoring and if you’ve built similar automated safeguards!