<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How we solved the &amp;quot;18-Hour Running Job&amp;quot; problem with Data-Driven Timeouts in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-we-solved-the-quot-18-hour-running-job-quot-problem-with/m-p/154902#M54153</link>
    <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I recently dealt with a frustrating scenario: a Databricks job that usually takes minutes ran for &lt;STRONG&gt;18 hours&lt;/STRONG&gt; without failing, quietly consuming compute and blocking downstream pipelines.&lt;/P&gt;&lt;P&gt;The driver hadn't crashed, and the job hadn't failed—it was just "stuck." This led our team to realize that while "adding a timeout" sounds simple, picking a static number is often counterproductive (especially for jobs that run frequently or have seasonal variance).&lt;/P&gt;&lt;P&gt;I wrote a technical deep dive on how we moved away from "guesswork" timeouts to a &lt;STRONG&gt;per-job, data-driven, and schedule-aware approach&lt;/STRONG&gt; using the Interquartile Range (IQR) method to filter noise.&lt;/P&gt;&lt;P&gt;I’ve shared the full logic and the story behind the 18-hour incident here: &lt;A href="https://medium.com/@avinash.narala6814/how-an-18-hour-databricks-job-run-led-us-to-build-smarter-timeouts-f3c0a5f85a48" target="_self"&gt;How an 18-Hour Databricks Job Run Led Us to Build Smarter Timeouts&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I'd love to hear how others in the community handle job monitoring and if you’ve built similar automated safeguards!&lt;/P&gt;</description>
    <pubDate>Mon, 20 Apr 2026 01:12:23 GMT</pubDate>
    <dc:creator>Avinash_Narala</dc:creator>
    <dc:date>2026-04-20T01:12:23Z</dc:date>
    <item>
      <title>How we solved the "18-Hour Running Job" problem with Data-Driven Timeouts</title>
      <link>https://community.databricks.com/t5/data-engineering/how-we-solved-the-quot-18-hour-running-job-quot-problem-with/m-p/154902#M54153</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I recently dealt with a frustrating scenario: a Databricks job that usually takes minutes ran for &lt;STRONG&gt;18 hours&lt;/STRONG&gt; without failing, quietly consuming compute and blocking downstream pipelines.&lt;/P&gt;&lt;P&gt;The driver hadn't crashed, and the job hadn't failed—it was just "stuck." This led our team to realize that while "adding a timeout" sounds simple, picking a static number is often counterproductive (especially for jobs that run frequently or have seasonal variance).&lt;/P&gt;&lt;P&gt;I wrote a technical deep dive on how we moved away from "guesswork" timeouts to a &lt;STRONG&gt;per-job, data-driven, and schedule-aware approach&lt;/STRONG&gt; using the Interquartile Range (IQR) method to filter noise.&lt;/P&gt;&lt;P&gt;I’ve shared the full logic and the story behind the 18-hour incident here: &lt;A href="https://medium.com/@avinash.narala6814/how-an-18-hour-databricks-job-run-led-us-to-build-smarter-timeouts-f3c0a5f85a48" target="_self"&gt;How an 18-Hour Databricks Job Run Led Us to Build Smarter Timeouts&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I'd love to hear how others in the community handle job monitoring and if you’ve built similar automated safeguards!&lt;/P&gt;</description>
      <pubDate>Mon, 20 Apr 2026 01:12:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-we-solved-the-quot-18-hour-running-job-quot-problem-with/m-p/154902#M54153</guid>
      <dc:creator>Avinash_Narala</dc:creator>
      <dc:date>2026-04-20T01:12:23Z</dc:date>
    </item>
  </channel>
</rss>

