topic Could not reach driver of cluster in Data Engineering

Could not reach driver of cluster

Yulei — Wed, 28 Feb 2024 00:09:35 GMT

Hi,

Rencently, I am seeing issue Could not reach driver of cluster <some_id> with my structure streaming job when migrating to unity catalog and found this when checking the traceback:

Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in <module>
main()
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main
registerPythonPathHook(entry_point, sc)
File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook
entry_point.setPythonPathHook(pathHook)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.
: java.lang.IllegalStateException: Promise already completed.
at scala.concurrent.Promise.complete(Promise.scala:53)
at scala.concurrent.Promise.complete$(Promise.scala:52)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86)
at scala.concurrent.Promise.success$(Promise.scala:86)
at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)
at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

There are ~55 tasks in the same job that split to use 3 different clusters

Change we have done before seeing this issue:
Cluster:

From unrestricted no isolation -> unrestricted single user (to enable Unity Catalog) and still failed with using job compute single user
From 13.0 Databricks version -> 13.3 Databricks versions, and still failed with same error with 14.0 and 14.3

Each cluster use i3en,large with autoscale from 1-3 worker and here is the spark config:
spark.executor.heartbeatInterval 10000000

spark.driver.maxResultSize 30g

spark.network.timeout 10000000

spark.sql.parquet.enableVectorizedReader false

spark.databricks.delta.preview.enabled true

maxRowsInMemory 1000

However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.

Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.

Let me know if need more information on understand what happened?

Re: Could not reach driver of cluster

Latonya86Dodson — Wed, 28 Feb 2024 09:28:28 GMT

@Yulei wrote:

Hi,
Rencently, I am seeing issue Could not reach driver of cluster <some_id> with my structure streaming job when migrating to unity catalog and found this when checking the traceback:
Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in <module>
main()
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main
registerPythonPathHook(entry_point, sc)
File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook
entry_point.setPythonPathHook(pathHook)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.
: java.lang.IllegalStateException: Promise already completed.
at scala.concurrent.Promise.complete(Promise.scala:53)
at scala.concurrent.Promise.complete$(Promise.scala:52)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86)
at scala.concurrent.Promise.success$(Promise.scala:86)
at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)
at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

There are ~55 tasks in the same job that split to use 3 different clusters
Change we have done before seeing this issue:
Cluster:
From unrestricted no isolation -> unrestricted single user (to enable Unity Catalog) and still failed with using job compute single user
From 13.0 Databricks version -> 13.3 Databricks versions, and still failed with same error with 14.0 and 14.3
Each cluster use i3en,large with autoscale from 1-3 worker and here is the spark config:
spark.executor.heartbeatInterval 10000000
spark.driver.maxResultSize 30g
spark.network.timeout 10000000
spark.sql.parquet.enableVectorizedReader false
spark.databricks.delta.preview.enabled true
maxRowsInMemory 1000
However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.

Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.

Let me know if need more information on understand what happened?

Hello,

It seems that the problem is related to the setPythonPathHook method, which is used to set the Python path for the driver and the workers. This method returns a Promise object, which can only be completed once.

However, in your case, it seems that the Promise object was already completed before the method was called, resulting in an IllegalStateException.

There are a few possible causes for this issue, such as:

A network connectivity issue between the driver and the workers, which could prevent the Promise object from being properly communicated or updated. This could also explain why you are seeing the “Could not reach driver of cluster” error.

You can check the network settings of your clusters and make sure they are not blocking any ports or protocols required by Databricks. You can also try to use a different network or region if possible.

A concurrency issue, where multiple threads or processes are trying to access or modify the same Promise object. This could happen if you are running multiple jobs or notebooks on the same cluster, or if you are using any parallelization or multiprocessing libraries in your code.

You can try to isolate your job or notebook from other workloads, or use synchronization mechanisms to avoid race conditions.

A memory issue, where the driver or the workers are running out of memory and causing the Promise object to be corrupted or lost. This could happen if you are processing large amounts of data or using memory-intensive libraries or operations. You can try to increase the memory allocation for your clusters, or optimize your code to reduce memory usage.

To troubleshoot this issue further, you can also look at the logs of your clusters and your jobs, and see if there are any other errors or warnings that could indicate the root cause.

I hope this helps you to fix this issue. Thanks in advance!

Best regards,
Latonya86Dodson

Re: Could not reach driver of cluster

Latonya86Dodson — Thu, 29 Feb 2024 04:20:16 GMT

@Latonya86Dodson wrote:
@Yulei wrote:

Hi,
Rencently, I am seeing issue Could not reach driver of cluster <some_id> with my structure streaming job when migrating to unity catalog and found this when checking the traceback:
Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in <module>
main()
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main
registerPythonPathHook(entry_point, sc)
File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook
entry_point.setPythonPathHook(pathHook)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.
: java.lang.IllegalStateException: Promise already completed.
at scala.concurrent.Promise.complete(Promise.scala:53)
at scala.concurrent.Promise.complete$(Promise.scala:52)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86)
at scala.concurrent.Promise.success$(Promise.scala:86)
at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)
at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

There are ~55 tasks in the same job that split to use 3 different clusters
Change we have done before seeing this issue:
Cluster:
From unrestricted no isolation -> unrestricted single user (to enable Unity Catalog) and still failed with using job compute single user
From 13.0 Databricks version -> 13.3 Databricks versions, and still failed with same error with 14.0 and 14.3
Each cluster use i3en,large with autoscale from 1-3 worker and here is the spark config:
spark.executor.heartbeatInterval 10000000
spark.driver.maxResultSize.BeBallPlayers 30g
spark.network.timeout 10000000
spark.sql.parquet.enableVectorizedReader false
spark.databricks.delta.preview.enabled true
maxRowsInMemory 1000
However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.

Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.

Let me know if need more information on understand what happened?
Hello,
It seems that the problem is related to the setPythonPathHook method, which is used to set the Python path for the driver and the workers. This method returns a Promise object, which can only be completed once.

However, in your case, it seems that the Promise object was already completed before the method was called, resulting in an IllegalStateException.
There are a few possible causes for this issue, such as:
A network connectivity issue between the driver and the workers, which could prevent the Promise object from being properly communicated or updated. This could also explain why you are seeing the “Could not reach driver of cluster” error.

You can check the network settings of your clusters and make sure they are not blocking any ports or protocols required by Databricks. You can also try to use a different network or region if possible.

A concurrency issue, where multiple threads or processes are trying to access or modify the same Promise object. This could happen if you are running multiple jobs or notebooks on the same cluster, or if you are using any parallelization or multiprocessing libraries in your code.

You can try to isolate your job or notebook from other workloads, or use synchronization mechanisms to avoid race conditions.

A memory issue, where the driver or the workers are running out of memory and causing the Promise object to be corrupted or lost. This could happen if you are processing large amounts of data or using memory-intensive libraries or operations. You can try to increase the memory allocation for your clusters, or optimize your code to reduce memory usage.

To troubleshoot this issue further, you can also look at the logs of your clusters and your jobs, and see if there are any other errors or warnings that could indicate the root cause.

I hope this helps you to fix this issue. Thanks in advance!

Best regards,
Latonya86Dodson

Hi,

Is this information was helpful to you or not? If this not works let me know will help you.

Re: Could not reach driver of cluster

Yulei — Sun, 03 Mar 2024 23:11:55 GMT

Hi thank for the reply, Apologise, I am going through each of the suggestion to understand the fix, and also try to understand why this issue not showing before I implement the change to use single user cluster. Will provide more update once go through each of them.

Re: Could not reach driver of cluster

Yulei — Tue, 05 Mar 2024 23:06:58 GMT

@Latonya86Dodson , Thank you for the reply. I have done a test, and it seems that double the memory of the driver cluster and change to use a instance with bigger memory works for this issue. However I do question why is this happen after I swap to personal compute in the job? Why I was not seeing this when using the unrestricted no isolation cluster?

Re: Could not reach driver of cluster

Kub4S — Tue, 02 Jul 2024 12:13:30 GMT

To expand on the same error "Could not reach driver of cluster XX" but different cause;

the reason in my case (ADF triggered databricks job which runs into this error) was a problem with a numpy library version, where solution is to downgrade the library on the cluster before run, e.g. "pip install numpy<2"

Re: Could not reach driver of cluster

copper-carrot — Thu, 01 Aug 2024 19:27:29 GMT

Kub4S

How do you downgrade the library on the cluster before run

Re: Could not reach driver of cluster

osingh — Wed, 03 Sep 2025 07:40:38 GMT

It seems like a temporary connectivity or cluster initialization glitch. So if anyone else runs into this, try re-running the job before diving into deeper troubleshooting - it might just work!

Hope this helps someone save time.