This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.
This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.
Here are few troubleshooting steps:
- Check Driver Logs
- Go to Compute → Cluster → Spark UI → Driver logs
- Search for:
- heartbeat timeout
- GC overhead limit exceeded
- OutOfMemoryError
- communication lost
- Check Databricks Event Logs
- system.logs or eventLogs table in Unity Catalog (if logging enabled).
- Monitor Cluster Metrics
- Enable cluster metrics via Databricks REST API or Azure Monitor integration.
- Look for CPU/memory spikes around failure time.
Here are some possible fixes you can implement.
Root Cause Mitigation
| Driver overload | Use larger driver; tune memory configs |
| Transient network loss | Enable retry logic in job or pipeline |
| Auto-termination wake-up | Keep cluster warm |
| Long DLT deployments | Separate deployment from execution |
| Azure transient failures | Retry, or contact Databricks support if frequent |