cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How we solved the "18-Hour Running Job" problem with Data-Driven Timeouts

Avinash_Narala
Databricks Partner

Hi everyone,

I recently dealt with a frustrating scenario: a Databricks job that usually takes minutes ran for 18 hours without failing, quietly consuming compute and blocking downstream pipelines.

The driver hadn't crashed, and the job hadn't failedโ€”it was just "stuck." This led our team to realize that while "adding a timeout" sounds simple, picking a static number is often counterproductive (especially for jobs that run frequently or have seasonal variance).

I wrote a technical deep dive on how we moved away from "guesswork" timeouts to a per-job, data-driven, and schedule-aware approach using the Interquartile Range (IQR) method to filter noise.

Iโ€™ve shared the full logic and the story behind the 18-hour incident here: How an 18-Hour Databricks Job Run Led Us to Build Smarter Timeouts

I'd love to hear how others in the community handle job monitoring and if youโ€™ve built similar automated safeguards!

0 REPLIES 0