<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Azure Databricks Job Run Failed with Error - Could not reach driver of cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132227#M49389</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/184970"&gt;@sandeepsuresh16&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can follow the recommendations and also check the KB articles mentioned below by&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;. I think those should help you&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 17 Sep 2025 11:20:29 GMT</pubDate>
    <dc:creator>K_Anudeep</dc:creator>
    <dc:date>2025-09-17T11:20:29Z</dc:date>
    <item>
      <title>Azure Databricks Job Run Failed with Error - Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132123#M49361</link>
      <description>&lt;P&gt;Hello Community,&lt;/P&gt;&lt;P&gt;I am facing an intermittent issue while running a Databricks job. The job fails with the following error message:&lt;/P&gt;&lt;P&gt;Run failed with error message:&lt;BR /&gt;Could not reach driver of cluster &amp;lt;cluster-id&amp;gt;.&lt;/P&gt;&lt;P&gt;Here are some additional details:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Cluster Type: Job cluster&lt;/LI&gt;&lt;LI&gt;Cluster Size: Standard_F8&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Job Setup: This job runs a standard ETL notebook&lt;/P&gt;&lt;P&gt;Behavior:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The error is not consistent; when retried, the job sometimes succeeds.&lt;/LI&gt;&lt;LI&gt;No recent changes were made to the job code or cluster configuration.&lt;/LI&gt;&lt;LI&gt;There were no schema changes in the tables involved.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Questions for the community:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;What are the possible reasons for getting a "Could not reach driver of cluster" error?&lt;/LI&gt;&lt;LI&gt;Is this usually caused by transient network issues, cluster instability, or driver overload?&lt;/LI&gt;&lt;LI&gt;Are there any recommended best practices or cluster configurations to prevent such driver reachability failures?&lt;/LI&gt;&lt;LI&gt;Should I look for specific logs or metrics in the driver logs to narrow down the root cause?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Any guidance or troubleshooting tips would be highly appreciated.&lt;BR /&gt;&lt;BR /&gt;Note: I attached the cluster log for reference&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Sep 2025 14:22:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132123#M49361</guid>
      <dc:creator>sandeepsuresh16</dc:creator>
      <dc:date>2025-09-16T14:22:27Z</dc:date>
    </item>
    <item>
      <title>Re: Azure Databricks Job Run Failed with Error - Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132144#M49369</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/184970"&gt;@sandeepsuresh16&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;SPAN&gt;Below are the answers to your questions:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;What are the possible reasons for getting a "Could not reach driver of cluster" error?&lt;/STRONG&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The error&amp;nbsp;"Could not reach driver of cluster &amp;lt;cluster-id&amp;gt;"&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;can occur due to several different reasons. Use the following troubleshooting steps to verify that the cause of your error matches any of the below:&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Check whether the job runs multiple tasks concurrently, which can increase the load on the driver.&lt;/LI&gt;&lt;LI&gt;During the time of failure, check if the driver’s CPU and memory utilization are unusually high (approaching or at 100%). I have checked the PDF attached and see a lot of continuous Full GC logs , indicating driver is under memory pressure trying to cleanup objects, which might be the cause of this issue.&lt;/LI&gt;&lt;LI&gt;Look for the following error trace in the driver logs. This error indicates a REPL (Read-Eval-Print Loop) startup failure due to a timeout, often caused by too many REPLs being created simultaneously.&lt;BR /&gt;&lt;STRONG&gt;Failed to start repl ReplId-&amp;lt;id&amp;gt; com.databricks.backend.daemon.driver.PythonDriverLocal$PythonException:&amp;nbsp; Unable to start python kernel for ReplId-&amp;lt;id&amp;gt;, kernel did not start within 80 seconds.&lt;/STRONG&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Is this usually caused by transient network issues, cluster instability, or driver overload?&amp;nbsp;&lt;/STRONG&gt;Answered above.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Are there any recommended best practices or cluster configurations to prevent such driver reachability failures?&lt;/STRONG&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Move from &lt;STRONG&gt;F-series&lt;/STRONG&gt; (compute-optimized) to a &lt;STRONG&gt;memory-optimized&lt;/STRONG&gt; driver (e.g., E/D-series) or at least a larger F node. Bump spark.driver.memory via node type, not just conf. Reduce collect()/toPandas() and any driver-side loops/UDF work&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you launch many notebooks/tasks at once, &lt;STRONG&gt;raise the REPL launch timeout&lt;/STRONG&gt; (Jobs→Compute→Spark config):&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;STRONG&gt;&lt;EM&gt;spark.databricks.driver.ipykernel.launchTimeoutSeconds 300&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Should I look for specific logs or metrics in the driver logs to narrow down the root cause?&lt;/STRONG&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;UL&gt;&lt;LI&gt;Cluster events (same timestamp as the failed run): look for DRIVER_NOT_RESPONDING / “Driver is up but not responsive, likely due to GC&lt;/LI&gt;&lt;LI&gt;You can always look at the driver memory metrics by going into the metrics tab to check for the utilisation by the driver.&lt;/LI&gt;&lt;LI&gt;Also, check whether the driver is under memory pressure by looking at the GC logs to see if there are frequent Full GC pauses.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Please do let me know if you have any further questions&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 16 Sep 2025 17:27:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132144#M49369</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-09-16T17:27:19Z</dc:date>
    </item>
    <item>
      <title>Re: Azure Databricks Job Run Failed with Error - Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132218#M49387</link>
      <description>&lt;P&gt;Hello Anudeep,&lt;/P&gt;&lt;P&gt;Thank you for your detailed response and the helpful recommendations.&lt;/P&gt;&lt;P&gt;I would like to provide some additional context:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;For our jobs, we are running only one notebook at a time, not multiple notebooks or tasks concurrently.&lt;/LI&gt;&lt;LI&gt;The issue occurs before the notebook execution starts — we do not see any cell execution in the logs.&lt;/LI&gt;&lt;LI&gt;We also could not find any error message like "Unable to start python kernel for ReplId-&amp;lt;id&amp;gt;, kernel did not start within 80 seconds."&lt;/LI&gt;&lt;LI&gt;This error happens right at the start of a job run using a job cluster, before any code in the notebook is executed.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Regarding your suggestion about changing to a memory-optimized driver series, thank you for the recommendation — we will definitely consider this option.&lt;/P&gt;&lt;P&gt;Please let me know if there are any additional logs or metrics you would recommend checking in this specific scenario.&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;BR /&gt;Sandeep&lt;/P&gt;</description>
      <pubDate>Wed, 17 Sep 2025 09:26:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132218#M49387</guid>
      <dc:creator>sandeepsuresh16</dc:creator>
      <dc:date>2025-09-17T09:26:51Z</dc:date>
    </item>
    <item>
      <title>Re: Azure Databricks Job Run Failed with Error - Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132220#M49388</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/184970"&gt;@sandeepsuresh16&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Check two below articles. In one of them they suggested metrics to check. Also, you will find there some suggestions on how to limit the occurrence of this problem.&lt;/P&gt;&lt;P&gt;&lt;A href="https://kb.databricks.com/clusters/workflows-are-failing-with-a-could-not-reach-driver-of-the-cluster-error?from_search=201357923" target="_blank" rel="noopener"&gt;Workflows are failing with a 'Could not reach driver of the cluster' error - Databricks&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://kb.databricks.com/clusters/job-run-fails-with-error-message-could-not-reach-driver-of-cluster?from_search=201357923" target="_blank" rel="noopener"&gt;Job run fails with error message “Could not reach driver of cluster” - Databricks&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1758101467930.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20035i30ACB34BD74D0ED6/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1758101467930.png" alt="szymon_dybczak_0-1758101467930.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Sep 2025 09:33:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132220#M49388</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-17T09:33:21Z</dc:date>
    </item>
    <item>
      <title>Re: Azure Databricks Job Run Failed with Error - Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132227#M49389</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/184970"&gt;@sandeepsuresh16&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can follow the recommendations and also check the KB articles mentioned below by&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;. I think those should help you&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Sep 2025 11:20:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/azure-databricks-job-run-failed-with-error-could-not-reach/m-p/132227#M49389</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-09-17T11:20:29Z</dc:date>
    </item>
  </channel>
</rss>

