cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Could not reach driver of cluster

Yulei
New Contributor III

 

Hi, 

Rencently, I am seeing issue Could not reach driver of cluster <some_id> with my structure streaming job when migrating to unity catalog and found this when checking the traceback:

Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in <module>
main()
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main
registerPythonPathHook(entry_point, sc)
File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook
entry_point.setPythonPathHook(pathHook)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.
: java.lang.IllegalStateException: Promise already completed.
at scala.concurrent.Promise.complete(Promise.scala:53)
at scala.concurrent.Promise.complete$(Promise.scala:52)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86)
at scala.concurrent.Promise.success$(Promise.scala:86)
at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)
at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

 

There are ~55 tasks in the same job that split to use 3 different clusters

Change we have done before seeing this issue:
Cluster:

From unrestricted no isolation -> unrestricted single user (to enable Unity Catalog) and still failed with using job compute single user
From  13.0  Databricks version -> 13.3 Databricks versions, and still failed with same error with 14.0 and 14.3

Each cluster use i3en,large with  autoscale from 1-3 worker and here is the spark config:
spark.executor.heartbeatInterval 10000000

spark.driver.maxResultSize 30g

spark.network.timeout 10000000

spark.sql.parquet.enableVectorizedReader false

spark.databricks.delta.preview.enabled true

maxRowsInMemory 1000

However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.


Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.

Let me know if need more information on understand what happened?

1 ACCEPTED SOLUTION

Accepted Solutions

Latonya86Dodson
New Contributor III

@Yulei wrote:

 

Hi, 

Rencently, I am seeing issue Could not reach driver of cluster <some_id> with my structure streaming job when migrating to unity catalog and found this when checking the traceback:

Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in <module>
main()
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main
registerPythonPathHook(entry_point, sc)
File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook
entry_point.setPythonPathHook(pathHook)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.
: java.lang.IllegalStateException: Promise already completed.
at scala.concurrent.Promise.complete(Promise.scala:53)
at scala.concurrent.Promise.complete$(Promise.scala:52)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86)
at scala.concurrent.Promise.success$(Promise.scala:86)
at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)
at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

 

There are ~55 tasks in the same job that split to use 3 different clusters

Change we have done before seeing this issue:
Cluster:

From unrestricted no isolation -> unrestricted single user (to enable Unity Catalog) and still failed with using job compute single user
From  13.0  Databricks version -> 13.3 Databricks versions, and still failed with same error with 14.0 and 14.3

Each cluster use i3en,large with  autoscale from 1-3 worker and here is the spark config:
spark.executor.heartbeatInterval 10000000

spark.driver.maxResultSize 30g

spark.network.timeout 10000000

spark.sql.parquet.enableVectorizedReader false

spark.databricks.delta.preview.enabled true

maxRowsInMemory 1000

However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.


Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.

Let me know if need more information on understand what happened?


Hello,

It seems that the problem is related to the setPythonPathHook method, which is used to set the Python path for the driver and the workers. This method returns a Promise object, which can only be completed once.

However, in your case, it seems that the Promise object was already completed before the method was called, resulting in an IllegalStateException.

There are a few possible causes for this issue, such as:

A network connectivity issue between the driver and the workers, which could prevent the Promise object from being properly communicated or updated. This could also explain why you are seeing the โ€œCould not reach driver of clusterโ€ error.

You can check the network settings of your clusters and make sure they are not blocking any ports or protocols required by Databricks. You can also try to use a different network or region if possible.

A concurrency issue, where multiple threads or processes are trying to access or modify the same Promise object. This could happen if you are running multiple jobs or notebooks on the same cluster, or if you are using any parallelization or multiprocessing libraries in your code.

You can try to isolate your job or notebook from other workloads, or use synchronization mechanisms to avoid race conditions.

A memory issue, where the driver or the workers are running out of memory and causing the Promise object to be corrupted or lost. This could happen if you are processing large amounts of data or using memory-intensive libraries or operations. You can try to increase the memory allocation for your clusters, or optimize your code to reduce memory usage.

To troubleshoot this issue further, you can also look at the logs of your clusters and your jobs, and see if there are any other errors or warnings that could indicate the root cause.

I hope this helps you to fix this issue. Thanks in advance!

Best regards, 
Latonya86Dodson

 

View solution in original post

4 REPLIES 4

Latonya86Dodson
New Contributor III

@Yulei wrote:

 

Hi, 

Rencently, I am seeing issue Could not reach driver of cluster <some_id> with my structure streaming job when migrating to unity catalog and found this when checking the traceback:

Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in <module>
main()
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main
registerPythonPathHook(entry_point, sc)
File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook
entry_point.setPythonPathHook(pathHook)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.
: java.lang.IllegalStateException: Promise already completed.
at scala.concurrent.Promise.complete(Promise.scala:53)
at scala.concurrent.Promise.complete$(Promise.scala:52)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86)
at scala.concurrent.Promise.success$(Promise.scala:86)
at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)
at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

 

There are ~55 tasks in the same job that split to use 3 different clusters

Change we have done before seeing this issue:
Cluster:

From unrestricted no isolation -> unrestricted single user (to enable Unity Catalog) and still failed with using job compute single user
From  13.0  Databricks version -> 13.3 Databricks versions, and still failed with same error with 14.0 and 14.3

Each cluster use i3en,large with  autoscale from 1-3 worker and here is the spark config:
spark.executor.heartbeatInterval 10000000

spark.driver.maxResultSize 30g

spark.network.timeout 10000000

spark.sql.parquet.enableVectorizedReader false

spark.databricks.delta.preview.enabled true

maxRowsInMemory 1000

However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.


Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.

Let me know if need more information on understand what happened?


Hello,

It seems that the problem is related to the setPythonPathHook method, which is used to set the Python path for the driver and the workers. This method returns a Promise object, which can only be completed once.

However, in your case, it seems that the Promise object was already completed before the method was called, resulting in an IllegalStateException.

There are a few possible causes for this issue, such as:

A network connectivity issue between the driver and the workers, which could prevent the Promise object from being properly communicated or updated. This could also explain why you are seeing the โ€œCould not reach driver of clusterโ€ error.

You can check the network settings of your clusters and make sure they are not blocking any ports or protocols required by Databricks. You can also try to use a different network or region if possible.

A concurrency issue, where multiple threads or processes are trying to access or modify the same Promise object. This could happen if you are running multiple jobs or notebooks on the same cluster, or if you are using any parallelization or multiprocessing libraries in your code.

You can try to isolate your job or notebook from other workloads, or use synchronization mechanisms to avoid race conditions.

A memory issue, where the driver or the workers are running out of memory and causing the Promise object to be corrupted or lost. This could happen if you are processing large amounts of data or using memory-intensive libraries or operations. You can try to increase the memory allocation for your clusters, or optimize your code to reduce memory usage.

To troubleshoot this issue further, you can also look at the logs of your clusters and your jobs, and see if there are any other errors or warnings that could indicate the root cause.

I hope this helps you to fix this issue. Thanks in advance!

Best regards, 
Latonya86Dodson

 


@Latonya86Dodson wrote:

@Yulei wrote:

 

Hi, 

Rencently, I am seeing issue Could not reach driver of cluster <some_id> with my structure streaming job when migrating to unity catalog and found this when checking the traceback:

Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 142, in <module>
main()
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 129, in main
registerPythonPathHook(entry_point, sc)
File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 214, in registerPythonPathHook
entry_point.setPythonPathHook(pathHook)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 224, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling t.setPythonPathHook.
: java.lang.IllegalStateException: Promise already completed.
at scala.concurrent.Promise.complete(Promise.scala:53)
at scala.concurrent.Promise.complete$(Promise.scala:52)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86)
at scala.concurrent.Promise.success$(Promise.scala:86)
at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.setPythonPathHook(JupyterDriverLocal.scala:292)
at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

 

There are ~55 tasks in the same job that split to use 3 different clusters

Change we have done before seeing this issue:
Cluster:

From unrestricted no isolation -> unrestricted single user (to enable Unity Catalog) and still failed with using job compute single user
From  13.0  Databricks version -> 13.3 Databricks versions, and still failed with same error with 14.0 and 14.3

Each cluster use i3en,large with  autoscale from 1-3 worker and here is the spark config:
spark.executor.heartbeatInterval 10000000

spark.driver.maxResultSize.BeBallPlayers 30g

spark.network.timeout 10000000

spark.sql.parquet.enableVectorizedReader false

spark.databricks.delta.preview.enabled true

maxRowsInMemory 1000

However, we are not seeing issue when running a single streaming task when created with seperate job to test and no issue when running with all purpose cluster from notebook interactively.


Fortunately, production has recover mechanism and recover in the next retry, and we still want to know what can be done so the streaming can be started without seeing cannot reach driver of cluster issue.

Let me know if need more information on understand what happened?


Hello,

It seems that the problem is related to the setPythonPathHook method, which is used to set the Python path for the driver and the workers. This method returns a Promise object, which can only be completed once.

However, in your case, it seems that the Promise object was already completed before the method was called, resulting in an IllegalStateException.

There are a few possible causes for this issue, such as:

A network connectivity issue between the driver and the workers, which could prevent the Promise object from being properly communicated or updated. This could also explain why you are seeing the โ€œCould not reach driver of clusterโ€ error.

You can check the network settings of your clusters and make sure they are not blocking any ports or protocols required by Databricks. You can also try to use a different network or region if possible.

A concurrency issue, where multiple threads or processes are trying to access or modify the same Promise object. This could happen if you are running multiple jobs or notebooks on the same cluster, or if you are using any parallelization or multiprocessing libraries in your code.

You can try to isolate your job or notebook from other workloads, or use synchronization mechanisms to avoid race conditions.

A memory issue, where the driver or the workers are running out of memory and causing the Promise object to be corrupted or lost. This could happen if you are processing large amounts of data or using memory-intensive libraries or operations. You can try to increase the memory allocation for your clusters, or optimize your code to reduce memory usage.

To troubleshoot this issue further, you can also look at the logs of your clusters and your jobs, and see if there are any other errors or warnings that could indicate the root cause.

I hope this helps you to fix this issue. Thanks in advance!

Best regards, 
Latonya86Dodson

 


Hi, 

Is this information was helpful to you or not? If this not works let me know will help you. 

 

 

 

Yulei
New Contributor III

Hi thank for the reply, Apologise, I am going through each of the suggestion to understand the fix, and also try to understand why this issue not showing before I implement the change to use single user cluster. Will provide more update once go through each of them.

Yulei
New Contributor III

@Latonya86Dodson , Thank you for the reply. I have done a test, and it seems that double the memory of the driver cluster and change to use a instance with bigger memory works for this issue. However I do question why is this happen after I swap to personal compute in the job? Why I was not seeing this when using the unrestricted no isolation cluster? 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.