cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT | Communication lost with driver | Cluster was not reachable for 120 seconds

mkwparth
New Contributor III

Hey Community, 

I'm facing this error, It says that "com.databricks.pipelines.common.errors.deployment.DeploymentException: Communication lost with driver. Cluster 1030-205818-yu28ft9s was not reachable for 120 seconds" 

mkwparth_0-1761892686441.png

This issue occurred in production, but after re-running the job, it worked fine. I'm unable to figure out why it happens intermittently -  it’s quite a strange and inconsistent error. Has anyone else experienced something similar or knows what might be causing it?

1 ACCEPTED SOLUTION

Accepted Solutions

nayan_wylde
Esteemed Contributor

This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.

This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.

 

Here are few troubleshooting steps:

  1. Check Driver Logs
    • Go to Compute → Cluster → Spark UI → Driver logs
    • Search for:
      • heartbeat timeout
      • GC overhead limit exceeded
      • OutOfMemoryError
      • communication lost
  2. Check Databricks Event Logs
    • system.logs or eventLogs table in Unity Catalog (if logging enabled).
  3. Monitor Cluster Metrics
    • Enable cluster metrics via Databricks REST API or Azure Monitor integration.
    • Look for CPU/memory spikes around failure time.

Here are some possible fixes you can implement.

Root Cause Mitigation

Driver overloadUse larger driver; tune memory configs
Transient network lossEnable retry logic in job or pipeline
Auto-termination wake-upKeep cluster warm
Long DLT deploymentsSeparate deployment from execution
Azure transient failuresRetry, or contact Databricks support if frequent

View solution in original post

2 REPLIES 2

AbhaySingh
Databricks Employee
Databricks Employee

nayan_wylde
Esteemed Contributor

This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.

This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.

 

Here are few troubleshooting steps:

  1. Check Driver Logs
    • Go to Compute → Cluster → Spark UI → Driver logs
    • Search for:
      • heartbeat timeout
      • GC overhead limit exceeded
      • OutOfMemoryError
      • communication lost
  2. Check Databricks Event Logs
    • system.logs or eventLogs table in Unity Catalog (if logging enabled).
  3. Monitor Cluster Metrics
    • Enable cluster metrics via Databricks REST API or Azure Monitor integration.
    • Look for CPU/memory spikes around failure time.

Here are some possible fixes you can implement.

Root Cause Mitigation

Driver overloadUse larger driver; tune memory configs
Transient network lossEnable retry logic in job or pipeline
Auto-termination wake-upKeep cluster warm
Long DLT deploymentsSeparate deployment from execution
Azure transient failuresRetry, or contact Databricks support if frequent

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now