cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Cluster xxxxxxx was terminated during the run.

Eduard
New Contributor II

Hello,

I have a problem with the autoscaling of a cluster. Every time the autoscaling is activated I get this error. Does anyone have any idea why this could be?

"Cluster xxxxxxx was terminated during the run (cluster state message: Lost communication with the driver node. This can occur because of networking errors or malfunctioning instances. databricks_error_message: driver is lost) "

Also from time to time I get this error also: 

Cluster xxxxxx  was terminated during the run (cluster state message: Setting up 6 nodes.)

3 REPLIES 3

Eduard
New Contributor II

So i could see more in deep the logs and i got this:

CPU is not the problem. 

Caused by: com.databricks.backend.manager.instance.FirewallSetupException: Fail to setup inbound Firewall.

 

I got this error while the autoscaling was ON. Must be something with my network, not sure what.. 

louisgarza
New Contributor II

Hello Databricks Community,

It looks like your cluster is being terminated due to a lost connection with the driver node, which could be caused by network instability or malfunctioning instances. The second error message suggests that the cluster is being terminated while scaling up, possibly due to resource allocation issues.

Here are a few things you can check:

  1. Cluster Logs โ€“ Review the logs in Databricks to see if there are more specific error messages.
  2. Cloud Provider Limits โ€“ Ensure that your cloud provider is not enforcing limits on the number of instances you can allocate.
  3. Networking Issues โ€“ Check your VPC settings, security groups, and firewall rules to ensure there are no restrictions on communication between nodes.
  4. Instance Availability โ€“ Sometimes, cloud providers have shortages of specific instance types, which can cause scaling issues. Try using different instance types.
  5. Databricks Support โ€“ If the issue persists, consider reaching out to Databricks support with your cluster ID and logs for further investigation.

Let me know if you need more help troubleshooting...Kindly take this thread serious!

louisgarza
New Contributor II

Hello Databricks Community,

The error message indicates that the driver node was lost, which can happen due to network issues or malfunctioning instances. Here are a few possible reasons and solutions:

  1. Instance Instability: If your cloud provider has unstable instances, try using a different instance type.
  2. Networking Issues: Ensure your VPC and security group settings allow stable communication between nodes.
  3. Autoscaling Interruption: Sometimes, aggressive autoscaling can cause driver instability. Try adjusting the scaling settings.
  4. Databricks Logs & Event History: Check the logs in the Databricks event timeline for more details on why the driver was lost.

For a smooth experience with online streaming, you might also want to check out NetMirror Netflix, a free streaming app that offers seamless content access.

Let me know if you need further assistance.

Best regards!!