Databricks Community

mkwparth · ‎10-30-2025

Hey Community,

I'm facing this error, It says that "com.databricks.pipelines.common.errors.deployment.DeploymentException: Communication lost with driver. Cluster 1030-205818-yu28ft9s was not reachable for 120 seconds"

This issue occurred in production, but after re-running the job, it worked fine. I'm unable to figure out why it happens intermittently - it’s quite a strange and inconsistent error. Has anyone else experienced something similar or knows what might be causing it?

nayan_wylde · ‎11-03-2025

This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.

This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.

Here are few troubleshooting steps:

Check Driver Logs

Go to Compute → Cluster → Spark UI → Driver logs
Search for:

heartbeat timeout
GC overhead limit exceeded
OutOfMemoryError
communication lost

Check Databricks Event Logs

system.logs or eventLogs table in Unity Catalog (if logging enabled).

Monitor Cluster Metrics

Enable cluster metrics via Databricks REST API or Azure Monitor integration.
Look for CPU/memory spikes around failure time.

Here are some possible fixes you can implement.

Root Cause Mitigation

Driver overload	Use larger driver; tune memory configs
Transient network loss	Enable retry logic in job or pipeline
Auto-termination wake-up	Keep cluster warm
Long DLT deployments	Separate deployment from execution
Azure transient failures	Retry, or contact Databricks support if frequent

View solution in original post

AbhaySingh · ‎11-03-2025

Can you please try looking at detailed logs?

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-log-delivery

nayan_wylde · ‎11-03-2025

This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.

This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the driver node of your cluster.Essentially, Databricks lost communication with the driver for 2 minutes (120 seconds). After that period, it assumes the driver is dead and throws this exception.Then, when you rerun, it works — because the cluster re-initializes and network connections reset.

Here are few troubleshooting steps:

Check Driver Logs

Go to Compute → Cluster → Spark UI → Driver logs
Search for:

heartbeat timeout
GC overhead limit exceeded
OutOfMemoryError
communication lost

Check Databricks Event Logs

system.logs or eventLogs table in Unity Catalog (if logging enabled).

Monitor Cluster Metrics

Enable cluster metrics via Databricks REST API or Azure Monitor integration.
Look for CPU/memory spikes around failure time.

Here are some possible fixes you can implement.

Root Cause Mitigation

Driver overload	Use larger driver; tune memory configs
Transient network loss	Enable retry logic in job or pipeline
Auto-termination wake-up	Keep cluster warm
Long DLT deployments	Separate deployment from execution
Azure transient failures	Retry, or contact Databricks support if frequent

Databricks Community

DLT | Communication lost with driver | Cluster was not reachable for 120 seconds

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐