topic Re: Cluster xxxxxxx was terminated during the run. in Data Engineering

Cluster xxxxxxx was terminated during the run.

Eduard — Wed, 23 Aug 2023 08:30:35 GMT

Hello,

I have a problem with the autoscaling of a cluster. Every time the autoscaling is activated I get this error. Does anyone have any idea why this could be?

"Cluster xxxxxxx was terminated during the run (cluster state message: Lost communication with the driver node. This can occur because of networking errors or malfunctioning instances. databricks_error_message: driver is lost) "

Also from time to time I get this error also:

Cluster xxxxxx was terminated during the run (cluster state message: Setting up 6 nodes.)

Re: Cluster xxxxxxx was terminated during the run.

Eduard — Wed, 30 Aug 2023 10:05:10 GMT

So i could see more in deep the logs and i got this:

CPU is not the problem.

Caused by: com.databricks.backend.manager.instance.FirewallSetupException: Fail to setup inbound Firewall.

I got this error while the autoscaling was ON. Must be something with my network, not sure what..

Re: Cluster xxxxxxx was terminated during the run.

louisgarza — Sun, 09 Mar 2025 09:20:29 GMT

Hello Databricks Community,

It looks like your cluster is being terminated due to a lost connection with the driver node, which could be caused by network instability or malfunctioning instances. The second error message suggests that the cluster is being terminated while scaling up, possibly due to resource allocation issues.

Here are a few things you can check:

Cluster Logs – Review the logs in Databricks to see if there are more specific error messages.
Cloud Provider Limits – Ensure that your cloud provider is not enforcing limits on the number of instances you can allocate.
Networking Issues – Check your VPC settings, security groups, and firewall rules to ensure there are no restrictions on communication between nodes.
Instance Availability – Sometimes, cloud providers have shortages of specific instance types, which can cause scaling issues. Try using different instance types.
Databricks Support – If the issue persists, consider reaching out to Databricks support with your cluster ID and logs for further investigation.

Let me know if you need more help troubleshooting...Kindly take this thread serious!

Re: Cluster xxxxxxx was terminated during the run.

louisgarza — Mon, 10 Mar 2025 11:10:00 GMT

Hello Databricks Community,

The error message indicates that the driver node was lost, which can happen due to network issues or malfunctioning instances. Here are a few possible reasons and solutions:

Instance Instability: If your cloud provider has unstable instances, try using a different instance type.
Networking Issues: Ensure your VPC and security group settings allow stable communication between nodes.
Autoscaling Interruption: Sometimes, aggressive autoscaling can cause driver instability. Try adjusting the scaling settings.
Databricks Logs & Event History: Check the logs in the Databricks event timeline for more details on why the driver was lost.

For a smooth experience with online streaming, you might also want to check out NetMirror Netflix, a free streaming app that offers seamless content access.

Let me know if you need further assistance.

Best regards!!

Re: Cluster xxxxxxx was terminated during the run.

denny492 — Wed, 26 Nov 2025 07:06:05 GMT

A “Cluster xxxxxxx was terminated during the run” message usually means the system stopped your cluster because it ran out of resources, hit an inactivity timeout, or encountered a critical error. This can happen when a job exceeds memory limits, Delta Executor Apk the compute environment shuts down unexpectedly, or the platform automatically terminates idle clusters. Restarting the cluster, reviewing resource settings, and checking logs for failure points can help prevent the issue from occurring again.

Re: Cluster xxxxxxx was terminated during the run.

iyashk-DB — Wed, 26 Nov 2025 18:03:14 GMT

The FirewallSetupException is thrown when Cluster Manager tries to allow communication to newly launched containers and the node can’t apply updated iptables rules. This occurs in the code path for allowCommunicationFromOldHostsToNewContainers during add-containers/upsize operations.

A very common underlying cause is the node daemon failing to write the temporary firewall rule file due to “No space left on device,” which prevents iptables-restore from applying the rules.

Common root causes seen

The instance’s root volume is full (often due to archived log-daemon usage logs under /home/ubuntu/databricks/log-daemon/work/...), leading to “No space left on device” during firewall rule generation and apply.
Node daemon RPC failures (e.g., “Got invalid response: 404”) from the instance can also cause inbound firewall updates to fail.
In the same window you’ll often see cluster events like “Could not register new workers with running worker …” as the upsize/add-containers retries time out.

What you can do now (quick mitigation)

Retry the upsize or restart the cluster to replace the affected instances with fresh VMs, which typically clears local disk/log conditions.
If you can reach the instance, quickly check disk pressure:
- Run df -h and look for the root device (e.g., /dev/xvda1) at 100% usage.
- If confirmed, reduce/cleanup oversized log archives on that host or replace the instance.

Mitigations:

Shorten Spark event log rollover to reduce pressure on local storage during long-running jobs or noisy clusters, e.g.:
- spark.databricks.eventLog.rolloverIntervalSeconds=300.
Cleaning up oversized log-daemon archives on the affected host(s) restores autoscaling and allows firewall rule updates to succeed.

Re: Cluster xxxxxxx was terminated during the run.

marykline — Sat, 06 Dec 2025 12:16:04 GMT

Hello Databricks Community,

The driver node was lost, which might occur as a result of network problems or malfunctioning instances, according to the error message. Here are some potential causes and remedies:

Instance Instability: Consider switching to a different instance type if your cloud provider offers unstable instances.
Networking Problems: Make sure that consistent connectivity between nodes is enabled by your VPC and security group settings.

Autoscaling Interruption: Driver instability may occasionally result from severe autoscaling. Try changing the scaling parameters.
Databricks Event History & Logs: To learn more about the reasons behind the driver's disappearance, view the logs in the Databricks event timeline.
You might also want to look into, a free moviebox program that provides easy access to content, for a seamless online experience.

Please let me know if you require any other help.

Best regards!!

Re: Cluster xxxxxxx was terminated during the run.

joshhazel456 — Sun, 22 Feb 2026 05:18:00 GMT

Ensure the driver node is not using spot/preemptible instances, as they can terminate unexpectedly.

Increase the driver node size (more RAM/CPU) to prevent out-of-memory crashes.

Check the driver logs to identify memory, JVM, or networking errors.

Verify your cloud instance quota limits to confirm enough nodes can be provisioned.

Make sure the requested instance type is available in your selected availability zone.

Confirm your subnet has enough free IP addresses for scaling workers.

Review VPC, firewall, and security group rules to allow internal cluster communication.

Avoid aggressive autoscaling (e.g., scaling from 1 to many nodes instantly).

Set a reasonable minimum worker count to reduce cold-start failures.

Use on-demand instances for the driver for better stability.

Monitor cluster metrics (CPU, memory, network) during scaling events.

Test autoscaling with a smaller max node limit to isolate the issue.