cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Cluster xxxxxxx was terminated during the run.

Eduard
New Contributor II

Hello,

I have a problem with the autoscaling of a cluster. Every time the autoscaling is activated I get this error. Does anyone have any idea why this could be?

"Cluster xxxxxxx was terminated during the run (cluster state message: Lost communication with the driver node. This can occur because of networking errors or malfunctioning instances. databricks_error_message: driver is lost) "

Also from time to time I get this error also: 

Cluster xxxxxx  was terminated during the run (cluster state message: Setting up 6 nodes.)

6 REPLIES 6

Eduard
New Contributor II

So i could see more in deep the logs and i got this:

CPU is not the problem. 

Caused by: com.databricks.backend.manager.instance.FirewallSetupException: Fail to setup inbound Firewall.

 

I got this error while the autoscaling was ON. Must be something with my network, not sure what.. 

louisgarza
New Contributor II

Hello Databricks Community,

It looks like your cluster is being terminated due to a lost connection with the driver node, which could be caused by network instability or malfunctioning instances. The second error message suggests that the cluster is being terminated while scaling up, possibly due to resource allocation issues.

Here are a few things you can check:

  1. Cluster Logs โ€“ Review the logs in Databricks to see if there are more specific error messages.
  2. Cloud Provider Limits โ€“ Ensure that your cloud provider is not enforcing limits on the number of instances you can allocate.
  3. Networking Issues โ€“ Check your VPC settings, security groups, and firewall rules to ensure there are no restrictions on communication between nodes.
  4. Instance Availability โ€“ Sometimes, cloud providers have shortages of specific instance types, which can cause scaling issues. Try using different instance types.
  5. Databricks Support โ€“ If the issue persists, consider reaching out to Databricks support with your cluster ID and logs for further investigation.

Let me know if you need more help troubleshooting...Kindly take this thread serious!

louisgarza
New Contributor II

Hello Databricks Community,

The error message indicates that the driver node was lost, which can happen due to network issues or malfunctioning instances. Here are a few possible reasons and solutions:

  1. Instance Instability: If your cloud provider has unstable instances, try using a different instance type.
  2. Networking Issues: Ensure your VPC and security group settings allow stable communication between nodes.
  3. Autoscaling Interruption: Sometimes, aggressive autoscaling can cause driver instability. Try adjusting the scaling settings.
  4. Databricks Logs & Event History: Check the logs in the Databricks event timeline for more details on why the driver was lost.

For a smooth experience with online streaming, you might also want to check out NetMirror Netflix, a free streaming app that offers seamless content access.

Let me know if you need further assistance.

Best regards!!

denny492
New Contributor II

A โ€œCluster xxxxxxx was terminated during the runโ€ message usually means the system stopped your cluster because it ran out of resources, hit an inactivity timeout, or encountered a critical error. This can happen when a job exceeds memory limits, Delta Executor Apk the compute environment shuts down unexpectedly, or the platform automatically terminates idle clusters. Restarting the cluster, reviewing resource settings, and checking logs for failure points can help prevent the issue from occurring again.

iyashk-DB
Databricks Employee
Databricks Employee

The FirewallSetupException is thrown when Cluster Manager tries to allow communication to newly launched containers and the node canโ€™t apply updated iptables rules. This occurs in the code path for allowCommunicationFromOldHostsToNewContainers during add-containers/upsize operations.

A very common underlying cause is the node daemon failing to write the temporary firewall rule file due to โ€œNo space left on device,โ€ which prevents iptables-restore from applying the rules.

Common root causes seen

  • The instanceโ€™s root volume is full (often due to archived log-daemon usage logs under /home/ubuntu/databricks/log-daemon/work/...), leading to โ€œNo space left on deviceโ€ during firewall rule generation and apply.

  • Node daemon RPC failures (e.g., โ€œGot invalid response: 404โ€) from the instance can also cause inbound firewall updates to fail.

  • In the same window youโ€™ll often see cluster events like โ€œCould not register new workers with running worker โ€ฆโ€ as the upsize/add-containers retries time out.

What you can do now (quick mitigation)

  • Retry the upsize or restart the cluster to replace the affected instances with fresh VMs, which typically clears local disk/log conditions.

  • If you can reach the instance, quickly check disk pressure:

    • Run df -h and look for the root device (e.g., /dev/xvda1) at 100% usage.
    • If confirmed, reduce/cleanup oversized log archives on that host or replace the instance.

Mitigations:

  • Shorten Spark event log rollover to reduce pressure on local storage during long-running jobs or noisy clusters, e.g.:

    • spark.databricks.eventLog.rolloverIntervalSeconds=300.
  • Cleaning up oversized log-daemon archives on the affected host(s) restores autoscaling and allows firewall rule updates to succeed.

marykline
New Contributor

Hello Databricks Community,

The driver node was lost, which might occur as a result of network problems or malfunctioning instances, according to the error message. Here are some potential causes and remedies:


Instance Instability: Consider switching to a different instance type if your cloud provider offers unstable instances.
Networking Problems: Make sure that consistent connectivity between nodes is enabled by your VPC and security group settings.

Autoscaling Interruption: Driver instability may occasionally result from severe autoscaling. Try changing the scaling parameters.
Databricks Event History & Logs: To learn more about the reasons behind the driver's disappearance, view the logs in the Databricks event timeline.
You might also want to look into, a free moviebox program that provides easy access to content, for a seamless online experience.

Please let me know if you require any other help.

Best regards!!