topic DLT | Communication lost with driver | Cluster was not reachable for 120 seconds in Data Engineering

DLT | Communication lost with driver | Cluster was not reachable for 120 seconds

mkwparth — Fri, 31 Oct 2025 06:48:46 GMT

Hey Community,

I'm facing this error, It says that "com.databricks.pipelines.common.errors.deployment.DeploymentException: Communication lost with driver. Cluster 1030-205818-yu28ft9s was not reachable for 120 seconds"

This issue occurred in production, but after re-running the job, it worked fine. I'm unable to figure out why it happens intermittently - it’s quite a strange and inconsistent error. Has anyone else experienced something similar or knows what might be causing it?

Re: DLT | Communication lost with driver | Cluster was not reachable for 120 seconds

AbhaySingh — Mon, 03 Nov 2025 15:49:35 GMT

Can you please try looking at detailed logs?

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-log-delivery

Re: DLT | Communication lost with driver | Cluster was not reachable for 120 seconds

nayan_wylde — Mon, 03 Nov 2025 17:20:05 GMT

This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.

This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.

Here are few troubleshooting steps:

Check Driver Logs

Go to Compute → Cluster → Spark UI → Driver logs
Search for:

heartbeat timeout
GC overhead limit exceeded
OutOfMemoryError
communication lost

Check Databricks Event Logs

system.logs or eventLogs table in Unity Catalog (if logging enabled).

Monitor Cluster Metrics

Enable cluster metrics via Databricks REST API or Azure Monitor integration.
Look for CPU/memory spikes around failure time.

Here are some possible fixes you can implement.

Root Cause Mitigation

Driver overload	Use larger driver; tune memory configs
Transient network loss	Enable retry logic in job or pipeline
Auto-termination wake-up	Keep cluster warm
Long DLT deployments	Separate deployment from execution
Azure transient failures	Retry, or contact Databricks support if frequent