<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic DLT | Communication lost with driver | Cluster was not reachable for 120 seconds in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dlt-communication-lost-with-driver-cluster-was-not-reachable-for/m-p/136885#M50661</link>
    <description>&lt;P&gt;Hey Community,&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I'm facing this error, It says that "&lt;SPAN&gt;com.databricks.pipelines.common.errors.deployment.DeploymentException: Communication lost with driver. Cluster 1030-205818-yu28ft9s was not reachable for 120 seconds"&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="mkwparth_0-1761892686441.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21207i987CF5A31F1DB325/image-size/medium?v=v2&amp;amp;px=400" role="button" title="mkwparth_0-1761892686441.png" alt="mkwparth_0-1761892686441.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This issue occurred in production, but after re-running the job, it worked fine. I'm unable to figure out why it happens intermittently -&amp;nbsp; it’s quite a strange and inconsistent error. Has anyone else experienced something similar or knows what might be causing it?&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 31 Oct 2025 06:48:46 GMT</pubDate>
    <dc:creator>mkwparth</dc:creator>
    <dc:date>2025-10-31T06:48:46Z</dc:date>
    <item>
      <title>DLT | Communication lost with driver | Cluster was not reachable for 120 seconds</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-communication-lost-with-driver-cluster-was-not-reachable-for/m-p/136885#M50661</link>
      <description>&lt;P&gt;Hey Community,&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I'm facing this error, It says that "&lt;SPAN&gt;com.databricks.pipelines.common.errors.deployment.DeploymentException: Communication lost with driver. Cluster 1030-205818-yu28ft9s was not reachable for 120 seconds"&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="mkwparth_0-1761892686441.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21207i987CF5A31F1DB325/image-size/medium?v=v2&amp;amp;px=400" role="button" title="mkwparth_0-1761892686441.png" alt="mkwparth_0-1761892686441.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This issue occurred in production, but after re-running the job, it worked fine. I'm unable to figure out why it happens intermittently -&amp;nbsp; it’s quite a strange and inconsistent error. Has anyone else experienced something similar or knows what might be causing it?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 31 Oct 2025 06:48:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-communication-lost-with-driver-cluster-was-not-reachable-for/m-p/136885#M50661</guid>
      <dc:creator>mkwparth</dc:creator>
      <dc:date>2025-10-31T06:48:46Z</dc:date>
    </item>
    <item>
      <title>Re: DLT | Communication lost with driver | Cluster was not reachable for 120 seconds</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-communication-lost-with-driver-cluster-was-not-reachable-for/m-p/137419#M50741</link>
      <description>&lt;P&gt;Can you please try looking at detailed logs?&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-log-delivery" target="_blank" rel="nofollow noopener noreferrer"&gt;https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-log-delivery&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Nov 2025 15:49:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-communication-lost-with-driver-cluster-was-not-reachable-for/m-p/137419#M50741</guid>
      <dc:creator>AbhaySingh</dc:creator>
      <dc:date>2025-11-03T15:49:35Z</dc:date>
    </item>
    <item>
      <title>Re: DLT | Communication lost with driver | Cluster was not reachable for 120 seconds</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-communication-lost-with-driver-cluster-was-not-reachable-for/m-p/137435#M50742</link>
      <description>&lt;P&gt;This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.&lt;/P&gt;&lt;P&gt;This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here are few troubleshooting steps:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Check Driver Logs&lt;/STRONG&gt;&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;Go to &lt;STRONG&gt;Compute → Cluster → Spark UI → Driver logs&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;Search for:&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;heartbeat timeout&lt;/LI&gt;&lt;LI&gt;GC overhead limit exceeded&lt;/LI&gt;&lt;LI&gt;OutOfMemoryError&lt;/LI&gt;&lt;LI&gt;communication lost&lt;/LI&gt;&lt;/UL&gt;&lt;/UL&gt;&lt;LI&gt;&lt;STRONG&gt;Check Databricks Event Logs&lt;/STRONG&gt;&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;system.logs or eventLogs table in Unity Catalog (if logging enabled).&lt;/LI&gt;&lt;/UL&gt;&lt;LI&gt;&lt;STRONG&gt;Monitor Cluster Metrics&lt;/STRONG&gt;&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;Enable cluster metrics via Databricks REST API or Azure Monitor integration.&lt;/LI&gt;&lt;LI&gt;Look for CPU/memory spikes around failure time.&lt;/LI&gt;&lt;/UL&gt;&lt;/OL&gt;&lt;P&gt;Here are some possible fixes you can implement.&lt;/P&gt;&lt;P&gt;Root Cause Mitigation&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;Driver overload&lt;/TD&gt;&lt;TD&gt;Use larger driver; tune memory configs&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Transient network loss&lt;/TD&gt;&lt;TD&gt;Enable retry logic in job or pipeline&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Auto-termination wake-up&lt;/TD&gt;&lt;TD&gt;Keep cluster warm&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Long DLT deployments&lt;/TD&gt;&lt;TD&gt;Separate deployment from execution&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Azure transient failures&lt;/TD&gt;&lt;TD&gt;Retry, or contact Databricks support if frequent&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Mon, 03 Nov 2025 17:20:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-communication-lost-with-driver-cluster-was-not-reachable-for/m-p/137435#M50742</guid>
      <dc:creator>nayan_wylde</dc:creator>
      <dc:date>2025-11-03T17:20:05Z</dc:date>
    </item>
  </channel>
</rss>

