- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-30-2025 11:48 PM
Hey Community,
I'm facing this error, It says that "com.databricks.pipelines.common.errors.deployment.DeploymentException: Communication lost with driver. Cluster 1030-205818-yu28ft9s was not reachable for 120 seconds"
This issue occurred in production, but after re-running the job, it worked fine. I'm unable to figure out why it happens intermittently - it’s quite a strange and inconsistent error. Has anyone else experienced something similar or knows what might be causing it?
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-03-2025 07:49 AM
Can you please try looking at detailed logs?
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-log-delivery
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-03-2025 09:20 AM
This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.
This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.
Here are few troubleshooting steps:
- Check Driver Logs
- Go to Compute → Cluster → Spark UI → Driver logs
- Search for:
- heartbeat timeout
- GC overhead limit exceeded
- OutOfMemoryError
- communication lost
- Check Databricks Event Logs
- system.logs or eventLogs table in Unity Catalog (if logging enabled).
- Monitor Cluster Metrics
- Enable cluster metrics via Databricks REST API or Azure Monitor integration.
- Look for CPU/memory spikes around failure time.
Here are some possible fixes you can implement.
Root Cause Mitigation
| Driver overload | Use larger driver; tune memory configs |
| Transient network loss | Enable retry logic in job or pipeline |
| Auto-termination wake-up | Keep cluster warm |
| Long DLT deployments | Separate deployment from execution |
| Azure transient failures | Retry, or contact Databricks support if frequent |