Databricks Community

sandeepsuresh16 · ‎09-16-2025

Hello Community,

I am facing an intermittent issue while running a Databricks job. The job fails with the following error message:

Run failed with error message:
Could not reach driver of cluster <cluster-id>.

Here are some additional details:

Cluster Type: Job cluster
Cluster Size: Standard_F8

Job Setup: This job runs a standard ETL notebook

Behavior:

The error is not consistent; when retried, the job sometimes succeeds.
No recent changes were made to the job code or cluster configuration.
There were no schema changes in the tables involved.

Questions for the community:

What are the possible reasons for getting a "Could not reach driver of cluster" error?
Is this usually caused by transient network issues, cluster instability, or driver overload?
Are there any recommended best practices or cluster configurations to prevent such driver reachability failures?
Should I look for specific logs or metrics in the driver logs to narrow down the root cause?

Any guidance or troubleshooting tips would be highly appreciated.

Note: I attached the cluster log for reference

K_Anudeep · ‎09-16-2025

Hello @sandeepsuresh16 ,
Below are the answers to your questions:

What are the possible reasons for getting a "Could not reach driver of cluster" error?

The error "Could not reach driver of cluster <cluster-id>" can occur due to several different reasons. Use the following troubleshooting steps to verify that the cause of your error matches any of the below:

Check whether the job runs multiple tasks concurrently, which can increase the load on the driver.
During the time of failure, check if the driver’s CPU and memory utilization are unusually high (approaching or at 100%). I have checked the PDF attached and see a lot of continuous Full GC logs , indicating driver is under memory pressure trying to cleanup objects, which might be the cause of this issue.
Look for the following error trace in the driver logs. This error indicates a REPL (Read-Eval-Print Loop) startup failure due to a timeout, often caused by too many REPLs being created simultaneously.
Failed to start repl ReplId-<id> com.databricks.backend.daemon.driver.PythonDriverLocal$PythonException: Unable to start python kernel for ReplId-<id>, kernel did not start within 80 seconds.

Is this usually caused by transient network issues, cluster instability, or driver overload? Answered above.
Are there any recommended best practices or cluster configurations to prevent such driver reachability failures?

Move from F-series (compute-optimized) to a memory-optimized driver (e.g., E/D-series) or at least a larger F node. Bump spark.driver.memory via node type, not just conf. Reduce collect()/toPandas() and any driver-side loops/UDF work
If you launch many notebooks/tasks at once, raise the REPL launch timeout (Jobs→Compute→Spark config):
spark.databricks.driver.ipykernel.launchTimeoutSeconds 300
Should I look for specific logs or metrics in the driver logs to narrow down the root cause?

Cluster events (same timestamp as the failed run): look for DRIVER_NOT_RESPONDING / “Driver is up but not responsive, likely due to GC
You can always look at the driver memory metrics by going into the metrics tab to check for the utilisation by the driver.
Also, check whether the driver is under memory pressure by looking at the GC logs to see if there are frequent Full GC pauses.

Please do let me know if you have any further questions

Thanks

Anudeep

View solution in original post

szymon_dybczak · ‎09-17-2025

Hi @sandeepsuresh16 ,

Check two below articles. In one of them they suggested metrics to check. Also, you will find there some suggestions on how to limit the occurrence of this problem.

Workflows are failing with a 'Could not reach driver of the cluster' error - Databricks

Job run fails with error message “Could not reach driver of cluster” - Databricks

View solution in original post

K_Anudeep · ‎09-16-2025

Hello @sandeepsuresh16 ,
Below are the answers to your questions:

What are the possible reasons for getting a "Could not reach driver of cluster" error?

The error "Could not reach driver of cluster <cluster-id>" can occur due to several different reasons. Use the following troubleshooting steps to verify that the cause of your error matches any of the below:

Check whether the job runs multiple tasks concurrently, which can increase the load on the driver.
During the time of failure, check if the driver’s CPU and memory utilization are unusually high (approaching or at 100%). I have checked the PDF attached and see a lot of continuous Full GC logs , indicating driver is under memory pressure trying to cleanup objects, which might be the cause of this issue.
Look for the following error trace in the driver logs. This error indicates a REPL (Read-Eval-Print Loop) startup failure due to a timeout, often caused by too many REPLs being created simultaneously.
Failed to start repl ReplId-<id> com.databricks.backend.daemon.driver.PythonDriverLocal$PythonException: Unable to start python kernel for ReplId-<id>, kernel did not start within 80 seconds.

Is this usually caused by transient network issues, cluster instability, or driver overload? Answered above.
Are there any recommended best practices or cluster configurations to prevent such driver reachability failures?

Move from F-series (compute-optimized) to a memory-optimized driver (e.g., E/D-series) or at least a larger F node. Bump spark.driver.memory via node type, not just conf. Reduce collect()/toPandas() and any driver-side loops/UDF work
If you launch many notebooks/tasks at once, raise the REPL launch timeout (Jobs→Compute→Spark config):
spark.databricks.driver.ipykernel.launchTimeoutSeconds 300
Should I look for specific logs or metrics in the driver logs to narrow down the root cause?

Cluster events (same timestamp as the failed run): look for DRIVER_NOT_RESPONDING / “Driver is up but not responsive, likely due to GC
You can always look at the driver memory metrics by going into the metrics tab to check for the utilisation by the driver.
Also, check whether the driver is under memory pressure by looking at the GC logs to see if there are frequent Full GC pauses.

Please do let me know if you have any further questions

Thanks

Anudeep

sandeepsuresh16 · ‎09-17-2025

Hello Anudeep,

Thank you for your detailed response and the helpful recommendations.

I would like to provide some additional context:

For our jobs, we are running only one notebook at a time, not multiple notebooks or tasks concurrently.
The issue occurs before the notebook execution starts — we do not see any cell execution in the logs.
We also could not find any error message like "Unable to start python kernel for ReplId-<id>, kernel did not start within 80 seconds."
This error happens right at the start of a job run using a job cluster, before any code in the notebook is executed.

Regarding your suggestion about changing to a memory-optimized driver series, thank you for the recommendation — we will definitely consider this option.

Please let me know if there are any additional logs or metrics you would recommend checking in this specific scenario.

Thanks & Regards,
Sandeep

szymon_dybczak · ‎09-17-2025

Hi @sandeepsuresh16 ,

Check two below articles. In one of them they suggested metrics to check. Also, you will find there some suggestions on how to limit the occurrence of this problem.

Workflows are failing with a 'Could not reach driver of the cluster' error - Databricks

Job run fails with error message “Could not reach driver of cluster” - Databricks