<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Could not reach driver of cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62164#M31911</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Rencently, I am seeing issue&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;Could not reach driver of cluster &amp;lt;some_id&amp;gt;&lt;/STRONG&gt; with my structure streaming job when migrating to unity catalog and found this when&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;checking the traceback:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;BR /&gt;File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in &amp;lt;module&amp;gt;&lt;BR /&gt;main()&lt;BR /&gt;File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main&lt;BR /&gt;registerPythonPathHook(entry_point, sc)&lt;BR /&gt;File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook&lt;BR /&gt;entry_point.setPythonPathHook(pathHook)&lt;BR /&gt;File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__&lt;BR /&gt;return_value = get_return_value(&lt;BR /&gt;File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco&lt;BR /&gt;return f(*a, **kw)&lt;BR /&gt;File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value&lt;BR /&gt;raise Py4JJavaError(&lt;BR /&gt;py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.&lt;BR /&gt;: java.lang.IllegalStateException: Promise already completed.&lt;BR /&gt;at scala.concurrent.Promise.complete(Promise.scala:53)&lt;BR /&gt;at scala.concurrent.Promise.complete$(Promise.scala:52)&lt;BR /&gt;at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)&lt;BR /&gt;at scala.concurrent.Promise.success(Promise.scala:86)&lt;BR /&gt;at scala.concurrent.Promise.success$(Promise.scala:86)&lt;BR /&gt;at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)&lt;BR /&gt;at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)&lt;BR /&gt;at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)&lt;BR /&gt;at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)&lt;BR /&gt;at java.lang.reflect.Method.invoke(Method.java:498)&lt;BR /&gt;at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)&lt;BR /&gt;at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)&lt;BR /&gt;at py4j.Gateway.invoke(Gateway.java:306)&lt;BR /&gt;at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)&lt;BR /&gt;at py4j.commands.CallCommand.execute(CallCommand.java:79)&lt;BR /&gt;at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)&lt;BR /&gt;at py4j.ClientServerConnection.run(ClientServerConnection.java:115)&lt;BR /&gt;at java.lang.Thread.run(Thread.java:750)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There are ~55 tasks in the same job that split to use 3 different clusters&lt;/P&gt;&lt;P&gt;Change we have done before seeing this issue:&lt;BR /&gt;Cluster:&lt;/P&gt;&lt;P&gt;From &lt;STRONG&gt;unrestricted no isolation&lt;/STRONG&gt; -&amp;gt; &lt;STRONG&gt;unrestricted single user&lt;/STRONG&gt; (to enable Unity Catalog) and still failed with using job &lt;STRONG&gt;compute single user&lt;/STRONG&gt;&lt;BR /&gt;From&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;13.0&amp;nbsp; Databricks version&lt;/STRONG&gt; -&amp;gt;&amp;nbsp;&lt;STRONG&gt;13.3 Databricks versions&lt;/STRONG&gt;, and still failed with same error with &lt;STRONG&gt;14.0 and 14.3&lt;/STRONG&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Each cluster use &lt;STRONG&gt;i3en,large&lt;/STRONG&gt; with&amp;nbsp; autoscale from &lt;STRONG&gt;1-3&lt;/STRONG&gt; worker and here is the spark config:&lt;BR /&gt;&lt;STRONG&gt;spark.executor.heartbeatInterval 10000000&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.driver.maxResultSize 30g&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.network.timeout 10000000&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.sql.parquet.enableVectorizedReader false&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.databricks.delta.preview.enabled true&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;maxRowsInMemory 1000&lt;/STRONG&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.&lt;BR /&gt;&lt;BR /&gt;Let me know if need more information on understand what happened?&lt;/P&gt;</description>
    <pubDate>Wed, 28 Feb 2024 00:09:35 GMT</pubDate>
    <dc:creator>Yulei</dc:creator>
    <dc:date>2024-02-28T00:09:35Z</dc:date>
    <item>
      <title>Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62164#M31911</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Rencently, I am seeing issue&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;Could not reach driver of cluster &amp;lt;some_id&amp;gt;&lt;/STRONG&gt; with my structure streaming job when migrating to unity catalog and found this when&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;checking the traceback:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;BR /&gt;File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in &amp;lt;module&amp;gt;&lt;BR /&gt;main()&lt;BR /&gt;File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main&lt;BR /&gt;registerPythonPathHook(entry_point, sc)&lt;BR /&gt;File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook&lt;BR /&gt;entry_point.setPythonPathHook(pathHook)&lt;BR /&gt;File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__&lt;BR /&gt;return_value = get_return_value(&lt;BR /&gt;File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco&lt;BR /&gt;return f(*a, **kw)&lt;BR /&gt;File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value&lt;BR /&gt;raise Py4JJavaError(&lt;BR /&gt;py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.&lt;BR /&gt;: java.lang.IllegalStateException: Promise already completed.&lt;BR /&gt;at scala.concurrent.Promise.complete(Promise.scala:53)&lt;BR /&gt;at scala.concurrent.Promise.complete$(Promise.scala:52)&lt;BR /&gt;at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)&lt;BR /&gt;at scala.concurrent.Promise.success(Promise.scala:86)&lt;BR /&gt;at scala.concurrent.Promise.success$(Promise.scala:86)&lt;BR /&gt;at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)&lt;BR /&gt;at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)&lt;BR /&gt;at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)&lt;BR /&gt;at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)&lt;BR /&gt;at java.lang.reflect.Method.invoke(Method.java:498)&lt;BR /&gt;at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)&lt;BR /&gt;at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)&lt;BR /&gt;at py4j.Gateway.invoke(Gateway.java:306)&lt;BR /&gt;at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)&lt;BR /&gt;at py4j.commands.CallCommand.execute(CallCommand.java:79)&lt;BR /&gt;at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)&lt;BR /&gt;at py4j.ClientServerConnection.run(ClientServerConnection.java:115)&lt;BR /&gt;at java.lang.Thread.run(Thread.java:750)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There are ~55 tasks in the same job that split to use 3 different clusters&lt;/P&gt;&lt;P&gt;Change we have done before seeing this issue:&lt;BR /&gt;Cluster:&lt;/P&gt;&lt;P&gt;From &lt;STRONG&gt;unrestricted no isolation&lt;/STRONG&gt; -&amp;gt; &lt;STRONG&gt;unrestricted single user&lt;/STRONG&gt; (to enable Unity Catalog) and still failed with using job &lt;STRONG&gt;compute single user&lt;/STRONG&gt;&lt;BR /&gt;From&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;13.0&amp;nbsp; Databricks version&lt;/STRONG&gt; -&amp;gt;&amp;nbsp;&lt;STRONG&gt;13.3 Databricks versions&lt;/STRONG&gt;, and still failed with same error with &lt;STRONG&gt;14.0 and 14.3&lt;/STRONG&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Each cluster use &lt;STRONG&gt;i3en,large&lt;/STRONG&gt; with&amp;nbsp; autoscale from &lt;STRONG&gt;1-3&lt;/STRONG&gt; worker and here is the spark config:&lt;BR /&gt;&lt;STRONG&gt;spark.executor.heartbeatInterval 10000000&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.driver.maxResultSize 30g&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.network.timeout 10000000&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.sql.parquet.enableVectorizedReader false&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.databricks.delta.preview.enabled true&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;maxRowsInMemory 1000&lt;/STRONG&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.&lt;BR /&gt;&lt;BR /&gt;Let me know if need more information on understand what happened?&lt;/P&gt;</description>
      <pubDate>Wed, 28 Feb 2024 00:09:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62164#M31911</guid>
      <dc:creator>Yulei</dc:creator>
      <dc:date>2024-02-28T00:09:35Z</dc:date>
    </item>
    <item>
      <title>Re: Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62199#M31919</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/98929"&gt;@Yulei&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Rencently, I am seeing issue&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;Could not reach driver of cluster &amp;lt;some_id&amp;gt;&lt;/STRONG&gt; with my structure streaming job when migrating to unity catalog and found this when&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;checking the traceback:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;BR /&gt;File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in &amp;lt;module&amp;gt;&lt;BR /&gt;main()&lt;BR /&gt;File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main&lt;BR /&gt;registerPythonPathHook(entry_point, sc)&lt;BR /&gt;File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook&lt;BR /&gt;entry_point.setPythonPathHook(pathHook)&lt;BR /&gt;File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__&lt;BR /&gt;return_value = get_return_value(&lt;BR /&gt;File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco&lt;BR /&gt;return f(*a, **kw)&lt;BR /&gt;File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value&lt;BR /&gt;raise Py4JJavaError(&lt;BR /&gt;py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.&lt;BR /&gt;: java.lang.IllegalStateException: Promise already completed.&lt;BR /&gt;at scala.concurrent.Promise.complete(Promise.scala:53)&lt;BR /&gt;at scala.concurrent.Promise.complete$(Promise.scala:52)&lt;BR /&gt;at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)&lt;BR /&gt;at scala.concurrent.Promise.success(Promise.scala:86)&lt;BR /&gt;at scala.concurrent.Promise.success$(Promise.scala:86)&lt;BR /&gt;at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)&lt;BR /&gt;at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)&lt;BR /&gt;at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)&lt;BR /&gt;at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)&lt;BR /&gt;at java.lang.reflect.Method.invoke(Method.java:498)&lt;BR /&gt;at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)&lt;BR /&gt;at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)&lt;BR /&gt;at py4j.Gateway.invoke(Gateway.java:306)&lt;BR /&gt;at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)&lt;BR /&gt;at py4j.commands.CallCommand.execute(CallCommand.java:79)&lt;BR /&gt;at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)&lt;BR /&gt;at py4j.ClientServerConnection.run(ClientServerConnection.java:115)&lt;BR /&gt;at java.lang.Thread.run(Thread.java:750)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There are ~55 tasks in the same job that split to use 3 different clusters&lt;/P&gt;&lt;P&gt;Change we have done before seeing this issue:&lt;BR /&gt;Cluster:&lt;/P&gt;&lt;P&gt;From &lt;STRONG&gt;unrestricted no isolation&lt;/STRONG&gt; -&amp;gt; &lt;STRONG&gt;unrestricted single user&lt;/STRONG&gt; (to enable Unity Catalog) and still failed with using job &lt;STRONG&gt;compute single user&lt;/STRONG&gt;&lt;BR /&gt;From&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;13.0&amp;nbsp; Databricks version&lt;/STRONG&gt; -&amp;gt;&amp;nbsp;&lt;STRONG&gt;13.3 Databricks versions&lt;/STRONG&gt;, and still failed with same error with &lt;STRONG&gt;14.0 and 14.3&lt;/STRONG&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Each cluster use &lt;STRONG&gt;i3en,large&lt;/STRONG&gt; with&amp;nbsp; autoscale from &lt;STRONG&gt;1-3&lt;/STRONG&gt; worker and here is the spark config:&lt;BR /&gt;&lt;STRONG&gt;spark.executor.heartbeatInterval 10000000&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.driver.maxResultSize 30g&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.network.timeout 10000000&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.sql.parquet.enableVectorizedReader false&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;spark.databricks.delta.preview.enabled true&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;maxRowsInMemory 1000&lt;/STRONG&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.&lt;BR /&gt;&lt;BR /&gt;Let me know if need more information on understand what happened?&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;It seems that the problem is related to the setPythonPathHook method, which is used to set the Python path for the driver and the workers. This method returns a Promise object, which can only be completed once.&lt;BR /&gt;&lt;BR /&gt;However, in your case, it seems that the Promise object was already completed before the method was called, resulting in an IllegalStateException.&lt;/P&gt;&lt;P&gt;There are a few possible causes for this issue, such as:&lt;/P&gt;&lt;P&gt;A network connectivity issue between the driver and the workers, which could prevent the Promise object from being properly communicated or updated. This could also explain why you are seeing the “Could not reach driver of cluster” error.&lt;BR /&gt;&lt;BR /&gt;You can check the network settings of your clusters and make sure they are not blocking any ports or protocols required by Databricks. You can also try to use a different network or region if possible.&lt;BR /&gt;&lt;BR /&gt;A concurrency issue, where multiple threads or processes are trying to access or modify the same Promise object. This could happen if you are running multiple jobs or notebooks on the same cluster, or if you are using any parallelization or multiprocessing libraries in your code.&lt;BR /&gt;&lt;BR /&gt;You can try to isolate your job or notebook from other workloads, or use synchronization mechanisms to avoid race conditions.&lt;BR /&gt;&lt;BR /&gt;A memory issue, where the driver or the workers are running out of memory and causing the Promise object to be corrupted or lost. This could happen if you are processing large amounts of data or using memory-intensive libraries or operations. You can try to increase the memory allocation for your clusters, or optimize your code to reduce memory usage.&lt;BR /&gt;&lt;BR /&gt;To troubleshoot this issue further, you can also look at the logs of your clusters and your jobs, and see if there are any other errors or warnings that could indicate the root cause.&lt;BR /&gt;&lt;BR /&gt;I hope this helps you to fix this issue. Thanks in advance!&lt;BR /&gt;&lt;BR /&gt;Best regards,&amp;nbsp;&lt;BR /&gt;&lt;SPAN&gt;Latonya86Dodson&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 28 Feb 2024 09:28:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62199#M31919</guid>
      <dc:creator>Latonya86Dodson</dc:creator>
      <dc:date>2024-02-28T09:28:28Z</dc:date>
    </item>
    <item>
      <title>Re: Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62286#M31942</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/101088"&gt;@Latonya86Dodson&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/98929"&gt;@Yulei&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Rencently, I am seeing issue&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;Could not reach driver of cluster &amp;lt;some_id&amp;gt;&lt;/STRONG&gt; with my structure streaming job when migrating to unity catalog and found this when&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;checking the traceback:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;BR /&gt;File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in &amp;lt;module&amp;gt;&lt;BR /&gt;main()&lt;BR /&gt;File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main&lt;BR /&gt;registerPythonPathHook(entry_point, sc)&lt;BR /&gt;File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook&lt;BR /&gt;entry_point.setPythonPathHook(pathHook)&lt;BR /&gt;File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__&lt;BR /&gt;return_value = get_return_value(&lt;BR /&gt;File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco&lt;BR /&gt;return f(*a, **kw)&lt;BR /&gt;File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value&lt;BR /&gt;raise Py4JJavaError(&lt;BR /&gt;py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.&lt;BR /&gt;: java.lang.IllegalStateException: Promise already completed.&lt;BR /&gt;at scala.concurrent.Promise.complete(Promise.scala:53)&lt;BR /&gt;at scala.concurrent.Promise.complete$(Promise.scala:52)&lt;BR /&gt;at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)&lt;BR /&gt;at scala.concurrent.Promise.success(Promise.scala:86)&lt;BR /&gt;at scala.concurrent.Promise.success$(Promise.scala:86)&lt;BR /&gt;at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)&lt;BR /&gt;at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)&lt;BR /&gt;at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)&lt;BR /&gt;at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)&lt;BR /&gt;at java.lang.reflect.Method.invoke(Method.java:498)&lt;BR /&gt;at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)&lt;BR /&gt;at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)&lt;BR /&gt;at py4j.Gateway.invoke(Gateway.java:306)&lt;BR /&gt;at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)&lt;BR /&gt;at py4j.commands.CallCommand.execute(CallCommand.java:79)&lt;BR /&gt;at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)&lt;BR /&gt;at py4j.ClientServerConnection.run(ClientServerConnection.java:115)&lt;BR /&gt;at java.lang.Thread.run(Thread.java:750)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There are ~55 tasks in the same job that split to use 3 different clusters&lt;/P&gt;&lt;P&gt;Change we have done before seeing this issue:&lt;BR /&gt;Cluster:&lt;/P&gt;&lt;P&gt;From &lt;STRONG&gt;unrestricted no isolation&lt;/STRONG&gt; -&amp;gt; &lt;STRONG&gt;unrestricted single user&lt;/STRONG&gt; (to enable Unity Catalog) and still failed with using job &lt;STRONG&gt;compute single user&lt;/STRONG&gt;&lt;BR /&gt;From&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;13.0&amp;nbsp; Databricks version&lt;/STRONG&gt; -&amp;gt;&amp;nbsp;&lt;STRONG&gt;13.3 Databricks versions&lt;/STRONG&gt;, and still failed with same error with &lt;STRONG&gt;14.0 and 14.3&lt;/STRONG&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Each cluster use &lt;STRONG&gt;i3en,large&lt;/STRONG&gt; with&amp;nbsp; autoscale from &lt;STRONG&gt;1-3&lt;/STRONG&gt; worker and here is the spark config:&lt;BR /&gt;&lt;U&gt;&lt;STRONG&gt;spark.executor.heartbeatInterval 10000000&lt;/STRONG&gt; &lt;/U&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;&lt;U&gt;spark.driver.maxResultSize.&lt;A href="https://www.beballplayers.com/" target="_self"&gt;&lt;FONT color="#333333"&gt;BeBallPlayers&lt;/FONT&gt;&lt;/A&gt;&lt;/U&gt; 30g&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;spark.network.timeout 10000000&lt;/STRONG&gt; &lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;spark.sql.parquet.enableVectorizedReader false&lt;/STRONG&gt; &lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;spark.databricks.delta.preview.enabled true&lt;/STRONG&gt; &lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;U&gt;&lt;STRONG&gt;maxRowsInMemory 1000&lt;/STRONG&gt;&lt;/U&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.&lt;BR /&gt;&lt;BR /&gt;Let me know if need more information on understand what happened?&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;It seems that the problem is related to the setPythonPathHook method, which is used to set the Python path for the driver and the workers. This method returns a Promise object, which can only be completed once.&lt;BR /&gt;&lt;BR /&gt;However, in your case, it seems that the Promise object was already completed before the method was called, resulting in an IllegalStateException.&lt;/P&gt;&lt;P&gt;There are a few possible causes for this issue, such as:&lt;/P&gt;&lt;P&gt;A network connectivity issue between the driver and the workers, which could prevent the Promise object from being properly communicated or updated. This could also explain why you are seeing the “Could not reach driver of cluster” error.&lt;BR /&gt;&lt;BR /&gt;You can check the network settings of your clusters and make sure they are not blocking any ports or protocols required by Databricks. You can also try to use a different network or region if possible.&lt;BR /&gt;&lt;BR /&gt;A concurrency issue, where multiple threads or processes are trying to access or modify the same Promise object. This could happen if you are running multiple jobs or notebooks on the same cluster, or if you are using any parallelization or multiprocessing libraries in your code.&lt;BR /&gt;&lt;BR /&gt;You can try to isolate your job or notebook from other workloads, or use synchronization mechanisms to avoid race conditions.&lt;BR /&gt;&lt;BR /&gt;A memory issue, where the driver or the workers are running out of memory and causing the Promise object to be corrupted or lost. This could happen if you are processing large amounts of data or using memory-intensive libraries or operations. You can try to increase the memory allocation for your clusters, or optimize your code to reduce memory usage.&lt;BR /&gt;&lt;BR /&gt;To troubleshoot this issue further, you can also look at the logs of your clusters and your jobs, and see if there are any other errors or warnings that could indicate the root cause.&lt;BR /&gt;&lt;BR /&gt;I hope this helps you to fix this issue. Thanks in advance!&lt;BR /&gt;&lt;BR /&gt;Best regards,&amp;nbsp;&lt;BR /&gt;&lt;SPAN&gt;Latonya86Dodson&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is this information was helpful to you or not? If this not works let me know will help you.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 29 Feb 2024 04:20:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62286#M31942</guid>
      <dc:creator>Latonya86Dodson</dc:creator>
      <dc:date>2024-02-29T04:20:16Z</dc:date>
    </item>
    <item>
      <title>Re: Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62522#M31984</link>
      <description>&lt;P&gt;Hi thank for the reply, Apologise, I am going through each of the suggestion to understand the fix, and also try to understand why this issue not showing before I implement the change to use single user cluster. Will provide more update once go through each of them.&lt;/P&gt;</description>
      <pubDate>Sun, 03 Mar 2024 23:11:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62522#M31984</guid>
      <dc:creator>Yulei</dc:creator>
      <dc:date>2024-03-03T23:11:55Z</dc:date>
    </item>
    <item>
      <title>Re: Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62679#M32041</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/101088"&gt;@Latonya86Dodson&lt;/a&gt;&amp;nbsp;, Thank you for the reply. I have done a test, and it seems that double the memory of the driver cluster and change to use a instance with bigger memory works for this issue. However I do question why is this happen after I swap to personal compute in the job? Why I was not seeing this when using the unrestricted no isolation cluster?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 05 Mar 2024 23:06:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/62679#M32041</guid>
      <dc:creator>Yulei</dc:creator>
      <dc:date>2024-03-05T23:06:58Z</dc:date>
    </item>
    <item>
      <title>Re: Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/76501#M35230</link>
      <description>&lt;P&gt;To expand on the same error "&lt;STRONG&gt;Could not reach driver of cluster XX"&amp;nbsp;&lt;/STRONG&gt;but different cause;&lt;/P&gt;&lt;P&gt;the reason in my case (ADF triggered databricks job which runs into this error) was a problem with a &lt;STRONG&gt;numpy&lt;/STRONG&gt; library version, where solution is to downgrade the library on the cluster before run, e.g. &lt;STRONG&gt;"pip install numpy&amp;lt;2"&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jul 2024 12:13:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/76501#M35230</guid>
      <dc:creator>Kub4S</dc:creator>
      <dc:date>2024-07-02T12:13:30Z</dc:date>
    </item>
    <item>
      <title>Re: Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/81554#M36341</link>
      <description>&lt;P&gt;Kub4S&lt;/P&gt;&lt;P&gt;How do you&amp;nbsp;&lt;SPAN&gt;downgrade the library on the cluster before run&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Aug 2024 19:27:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/81554#M36341</guid>
      <dc:creator>copper-carrot</dc:creator>
      <dc:date>2024-08-01T19:27:29Z</dc:date>
    </item>
    <item>
      <title>Re: Could not reach driver of cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/130614#M48853</link>
      <description>&lt;P&gt;It seems like a temporary connectivity or cluster initialization glitch. So if anyone else runs into this, try re-running the job before diving into deeper troubleshooting - it might just work!&lt;/P&gt;&lt;P&gt;Hope this helps someone save time.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Sep 2025 07:40:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/m-p/130614#M48853</guid>
      <dc:creator>osingh</dc:creator>
      <dc:date>2025-09-03T07:40:38Z</dc:date>
    </item>
  </channel>
</rss>

