How we solved the "18-Hour Running Job" problem with Data-Driven Timeouts

Avinash_Narala — Mon, 20 Apr 2026 01:12:23 GMT

Hi everyone,

I recently dealt with a frustrating scenario: a Databricks job that usually takes minutes ran for 18 hours without failing, quietly consuming compute and blocking downstream pipelines.

The driver hadn't crashed, and the job hadn't failed—it was just "stuck." This led our team to realize that while "adding a timeout" sounds simple, picking a static number is often counterproductive (especially for jobs that run frequently or have seasonal variance).

I wrote a technical deep dive on how we moved away from "guesswork" timeouts to a per-job, data-driven, and schedule-aware approach using the Interquartile Range (IQR) method to filter noise.

I’ve shared the full logic and the story behind the 18-hour incident here: How an 18-Hour Databricks Job Run Led Us to Build Smarter Timeouts

I'd love to hear how others in the community handle job monitoring and if you’ve built similar automated safeguards!

topic How we solved the "18-Hour Running Job" problem with Data-Driven Timeouts in Data Engineering

How we solved the "18-Hour Running Job" problem with Data-Driven Timeouts