cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Azure Databricks Job Run Failed with Error - Could not reach driver of cluster

sandeepsuresh16
Visitor

Hello Community,

I am facing an intermittent issue while running a Databricks job. The job fails with the following error message:

Run failed with error message:
Could not reach driver of cluster <cluster-id>.

Here are some additional details:

  • Cluster Type: Job cluster
  • Cluster Size: Standard_F8

Job Setup: This job runs a standard ETL notebook

Behavior:

  • The error is not consistent; when retried, the job sometimes succeeds.
  • No recent changes were made to the job code or cluster configuration.
  • There were no schema changes in the tables involved.

Questions for the community:

  • What are the possible reasons for getting a "Could not reach driver of cluster" error?
  • Is this usually caused by transient network issues, cluster instability, or driver overload?
  • Are there any recommended best practices or cluster configurations to prevent such driver reachability failures?
  • Should I look for specific logs or metrics in the driver logs to narrow down the root cause?

Any guidance or troubleshooting tips would be highly appreciated.

Note: I attached the cluster log for reference

1 REPLY 1

K_Anudeep
Contributor

Hello @sandeepsuresh16 ,
Below are the answers to your questions:

  • What are the possible reasons for getting a "Could not reach driver of cluster" error?

The error "Could not reach driver of cluster <cluster-id>" can occur due to several different reasons. Use the following troubleshooting steps to verify that the cause of your error matches any of the below: 

  1. Check whether the job runs multiple tasks concurrently, which can increase the load on the driver.
  2. During the time of failure, check if the driver’s CPU and memory utilization are unusually high (approaching or at 100%). I have checked the PDF attached and see a lot of continuous Full GC logs , indicating driver is under memory pressure trying to cleanup objects, which might be the cause of this issue.
  3. Look for the following error trace in the driver logs. This error indicates a REPL (Read-Eval-Print Loop) startup failure due to a timeout, often caused by too many REPLs being created simultaneously.
    Failed to start repl ReplId-<id> com.databricks.backend.daemon.driver.PythonDriverLocal$PythonException:  Unable to start python kernel for ReplId-<id>, kernel did not start within 80 seconds.
  • Is this usually caused by transient network issues, cluster instability, or driver overload? Answered above.

  • Are there any recommended best practices or cluster configurations to prevent such driver reachability failures?
  1. Move from F-series (compute-optimized) to a memory-optimized driver (e.g., E/D-series) or at least a larger F node. Bump spark.driver.memory via node type, not just conf. Reduce collect()/toPandas() and any driver-side loops/UDF work

  2.  

    If you launch many notebooks/tasks at once, raise the REPL launch timeout (Jobs→Compute→Spark config):

    spark.databricks.driver.ipykernel.launchTimeoutSeconds 300

  3. Should I look for specific logs or metrics in the driver logs to narrow down the root cause?
  • Cluster events (same timestamp as the failed run): look for DRIVER_NOT_RESPONDING / “Driver is up but not responsive, likely due to GC
  • You can always look at the driver memory metrics by going into the metrics tab to check for the utilisation by the driver.
  • Also, check whether the driver is under memory pressure by looking at the GC logs to see if there are frequent Full GC pauses.

Please do let me know if you have any further questions

 

Thanks

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now