DLT | Communication lost with driver | Cluster was not reachable for 120 seconds

mkwparth
Databricks Partner

Hey Community, 

I'm facing this error, It says that "com.databricks.pipelines.common.errors.deployment.DeploymentException: Communication lost with driver. Cluster 1030-205818-yu28ft9s was not reachable for 120 seconds" 

mkwparth_0-1761892686441.png

This issue occurred in production, but after re-running the job, it worked fine. I'm unable to figure out why it happens intermittently -  it’s quite a strange and inconsistent error. Has anyone else experienced something similar or knows what might be causing it?

AbhaySingh
Databricks Employee
Databricks Employee

nayan_wylde
Esteemed Contributor II

This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.

This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.

 

Here are few troubleshooting steps:

  1. Check Driver Logs
    • Go to Compute → Cluster → Spark UI → Driver logs
    • Search for:
      • heartbeat timeout
      • GC overhead limit exceeded
      • OutOfMemoryError
      • communication lost
  2. Check Databricks Event Logs
    • system.logs or eventLogs table in Unity Catalog (if logging enabled).
  3. Monitor Cluster Metrics
    • Enable cluster metrics via Databricks REST API or Azure Monitor integration.
    • Look for CPU/memory spikes around failure time.

Here are some possible fixes you can implement.

Root Cause Mitigation

Driver overloadUse larger driver; tune memory configs
Transient network lossEnable retry logic in job or pipeline
Auto-termination wake-upKeep cluster warm
Long DLT deploymentsSeparate deployment from execution
Azure transient failuresRetry, or contact Databricks support if frequent

View solution in original post