How we solved the "18-Hour Running Job" problem with Data-Driven Timeouts

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Hi everyone,

I recently dealt with a frustrating scenario: a Databricks job that usually takes minutes ran for 18 hours without failing, quietly consuming compute and blocking downstream pipelines.

The driver hadn't crashed, and the job hadn't failed—it was just "stuck." This led our team to realize that while "adding a timeout" sounds simple, picking a static number is often counterproductive (especially for jobs that run frequently or have seasonal variance).

I wrote a technical deep dive on how we moved away from "guesswork" timeouts to a per-job, data-driven, and schedule-aware approach using the Interquartile Range (IQR) method to filter noise.

I’ve shared the full logic and the story behind the 18-hour incident here: How an 18-Hour Databricks Job Run Led Us to Build Smarter Timeouts

I'd love to hear how others in the community handle job monitoring and if you’ve built similar automated safeguards!