โ08-23-2023 01:30 AM
Hello,
I have a problem with the autoscaling of a cluster. Every time the autoscaling is activated I get this error. Does anyone have any idea why this could be?
"Cluster xxxxxxx was terminated during the run (cluster state message: Lost communication with the driver node. This can occur because of networking errors or malfunctioning instances. databricks_error_message: driver is lost) "
Also from time to time I get this error also:
Cluster xxxxxx was terminated during the run (cluster state message: Setting up 6 nodes.)
โ08-30-2023 03:05 AM
So i could see more in deep the logs and i got this:
CPU is not the problem.
Caused by: com.databricks.backend.manager.instance.FirewallSetupException: Fail to setup inbound Firewall.
I got this error while the autoscaling was ON. Must be something with my network, not sure what..
โ03-09-2025 01:20 AM
Hello Databricks Community,
It looks like your cluster is being terminated due to a lost connection with the driver node, which could be caused by network instability or malfunctioning instances. The second error message suggests that the cluster is being terminated while scaling up, possibly due to resource allocation issues.
Here are a few things you can check:
Let me know if you need more help troubleshooting...Kindly take this thread serious!
โ03-10-2025 04:10 AM
Hello Databricks Community,
The error message indicates that the driver node was lost, which can happen due to network issues or malfunctioning instances. Here are a few possible reasons and solutions:
For a smooth experience with online streaming, you might also want to check out NetMirror Netflix, a free streaming app that offers seamless content access.
Let me know if you need further assistance.
Best regards!!
2 weeks ago - last edited 2 weeks ago
A โCluster xxxxxxx was terminated during the runโ message usually means the system stopped your cluster because it ran out of resources, hit an inactivity timeout, or encountered a critical error. This can happen when a job exceeds memory limits, Delta Executor Apk the compute environment shuts down unexpectedly, or the platform automatically terminates idle clusters. Restarting the cluster, reviewing resource settings, and checking logs for failure points can help prevent the issue from occurring again.
2 weeks ago
The FirewallSetupException is thrown when Cluster Manager tries to allow communication to newly launched containers and the node canโt apply updated iptables rules. This occurs in the code path for allowCommunicationFromOldHostsToNewContainers during add-containers/upsize operations.
A very common underlying cause is the node daemon failing to write the temporary firewall rule file due to โNo space left on device,โ which prevents iptables-restore from applying the rules.
The instanceโs root volume is full (often due to archived log-daemon usage logs under /home/ubuntu/databricks/log-daemon/work/...), leading to โNo space left on deviceโ during firewall rule generation and apply.
Node daemon RPC failures (e.g., โGot invalid response: 404โ) from the instance can also cause inbound firewall updates to fail.
In the same window youโll often see cluster events like โCould not register new workers with running worker โฆโ as the upsize/add-containers retries time out.
Retry the upsize or restart the cluster to replace the affected instances with fresh VMs, which typically clears local disk/log conditions.
If you can reach the instance, quickly check disk pressure:
df -h and look for the root device (e.g., /dev/xvda1) at 100% usage.Shorten Spark event log rollover to reduce pressure on local storage during long-running jobs or noisy clusters, e.g.:
spark.databricks.eventLog.rolloverIntervalSeconds=300.Cleaning up oversized log-daemon archives on the affected host(s) restores autoscaling and allows firewall rule updates to succeed.
yesterday
Hello Databricks Community,
The driver node was lost, which might occur as a result of network problems or malfunctioning instances, according to the error message. Here are some potential causes and remedies:
Instance Instability: Consider switching to a different instance type if your cloud provider offers unstable instances.
Networking Problems: Make sure that consistent connectivity between nodes is enabled by your VPC and security group settings.
Autoscaling Interruption: Driver instability may occasionally result from severe autoscaling. Try changing the scaling parameters.
Databricks Event History & Logs: To learn more about the reasons behind the driver's disappearance, view the logs in the Databricks event timeline.
You might also want to look into, a free moviebox program that provides easy access to content, for a seamless online experience.
Please let me know if you require any other help.
Best regards!!
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now