<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cluster xxxxxxx was terminated during the run. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/140447#M51430</link>
    <description>&lt;P&gt;The FirewallSetupException is thrown when Cluster Manager tries to allow communication to newly launched containers and the node can’t apply updated iptables rules. This occurs in the code path for allowCommunicationFromOldHostsToNewContainers during add-containers/upsize operations.&lt;/P&gt;
&lt;P&gt;A very common underlying cause is the node daemon failing to write the temporary firewall rule file due to “No space left on device,” which prevents iptables-restore from applying the rules.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Common root causes seen&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;The instance’s &lt;STRONG&gt;root volume&lt;/STRONG&gt; is full (often due to archived log-daemon usage logs under &lt;CODE class="qt3gz9f"&gt;/home/ubuntu/databricks/log-daemon/work/...&lt;/CODE&gt;), leading to “No space left on device” during firewall rule generation and apply.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Node daemon RPC failures&lt;/STRONG&gt; (e.g., “Got invalid response: 404”) from the instance can also cause inbound firewall updates to fail.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;In the same window you’ll often see cluster events like “Could not register new workers with running worker …” as the upsize/add-containers retries time out.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;What you can do now (quick mitigation)&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Retry the upsize or &lt;STRONG&gt;restart the cluster&lt;/STRONG&gt; to replace the affected instances with fresh VMs, which typically clears local disk/log conditions.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you can reach the instance, quickly check disk pressure:&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Run &lt;CODE class="qt3gz9f"&gt;df -h&lt;/CODE&gt; and look for the root device (e.g., &lt;CODE class="qt3gz9f"&gt;/dev/xvda1&lt;/CODE&gt;) at 100% usage.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;If confirmed, reduce/cleanup oversized log archives on that host or replace the instance.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Mitigations:&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Shorten Spark &lt;STRONG&gt;event log rollover&lt;/STRONG&gt; to reduce pressure on local storage during long-running jobs or noisy clusters, e.g.:&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;&lt;CODE class="qt3gz9f"&gt;spark.databricks.eventLog.rolloverIntervalSeconds=300&lt;/CODE&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Cleaning up oversized &lt;STRONG&gt;log-daemon archives&lt;/STRONG&gt; on the affected host(s) restores autoscaling and allows firewall rule updates to succeed.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Wed, 26 Nov 2025 18:03:14 GMT</pubDate>
    <dc:creator>iyashk-DB</dc:creator>
    <dc:date>2025-11-26T18:03:14Z</dc:date>
    <item>
      <title>Cluster xxxxxxx was terminated during the run.</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/41119#M27296</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have a problem with the autoscaling of a cluster. Every time the autoscaling is activated I get this error. Does anyone have any idea why this could be?&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;"Cluster xxxxxxx was terminated during the run (cluster state message: Lost communication with the driver node. This can occur because of networking errors or malfunctioning instances. databricks_error_message: driver is lost) "&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Also from time to time I get this error also:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Cluster xxxxxx&amp;nbsp; was terminated during the run (cluster state message: Setting up 6 nodes.)&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Aug 2023 08:30:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/41119#M27296</guid>
      <dc:creator>Eduard</dc:creator>
      <dc:date>2023-08-23T08:30:35Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster xxxxxxx was terminated during the run.</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/42860#M27433</link>
      <description>&lt;P&gt;So i could see more in deep the logs and i got this:&lt;/P&gt;&lt;P&gt;CPU is not the problem.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Caused by: com.databricks.backend.manager.instance.FirewallSetupException: Fail to setup inbound Firewall.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I got this error while the autoscaling was ON. Must be something with my network, not sure what..&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Aug 2023 10:05:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/42860#M27433</guid>
      <dc:creator>Eduard</dc:creator>
      <dc:date>2023-08-30T10:05:10Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster xxxxxxx was terminated during the run.</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/112092#M44105</link>
      <description>&lt;P&gt;Hello &lt;STRONG&gt;Databricks Community&lt;/STRONG&gt;,&lt;/P&gt;&lt;P&gt;It looks like your cluster is being terminated due to a lost connection with the driver node, which could be caused by network instability or malfunctioning instances. The second error message suggests that the cluster is being terminated while scaling up, possibly due to resource allocation issues.&lt;/P&gt;&lt;P&gt;Here are a few things you can check:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Cluster Logs&lt;/STRONG&gt; – Review the logs in Databricks to see if there are more specific error messages.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Cloud Provider Limits&lt;/STRONG&gt; – Ensure that your cloud provider is not enforcing limits on the number of instances you can allocate.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Networking Issues&lt;/STRONG&gt; – Check your VPC settings, security groups, and firewall rules to ensure there are no restrictions on communication between nodes.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Instance Availability&lt;/STRONG&gt; – Sometimes, cloud providers have shortages of specific instance types, which can cause scaling issues. Try using different instance types.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks Support&lt;/STRONG&gt; – If the issue persists, consider reaching out to Databricks support with your cluster ID and logs for further investigation.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Let me know if you need more help troubleshooting...Kindly take this thread serious!&lt;/P&gt;</description>
      <pubDate>Sun, 09 Mar 2025 09:20:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/112092#M44105</guid>
      <dc:creator>louisgarza</dc:creator>
      <dc:date>2025-03-09T09:20:29Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster xxxxxxx was terminated during the run.</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/112148#M44121</link>
      <description>&lt;P&gt;Hello &lt;STRONG&gt;Databricks Community&lt;/STRONG&gt;,&lt;/P&gt;&lt;P&gt;The error message indicates that the driver node was lost, which can happen due to network issues or malfunctioning instances. Here are a few possible reasons and solutions:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Instance Instability:&lt;/STRONG&gt; If your cloud provider has unstable instances, try using a different instance type.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Networking Issues:&lt;/STRONG&gt; Ensure your VPC and security group settings allow stable communication between nodes.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Autoscaling Interruption:&lt;/STRONG&gt; Sometimes, aggressive autoscaling can cause driver instability. Try adjusting the scaling settings.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks Logs &amp;amp; Event History:&lt;/STRONG&gt; Check the logs in the &lt;STRONG&gt;Databricks event timeline&lt;/STRONG&gt; for more details on why the driver was lost.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;For a smooth experience with online streaming, you might also want to check out &lt;A href="https://netmirrors.app/" target="_self"&gt;NetMirror Netflix&lt;/A&gt;, a free streaming app that offers seamless content access.&lt;/P&gt;&lt;P&gt;Let me know if you need further assistance.&lt;/P&gt;&lt;P&gt;Best regards!!&lt;/P&gt;</description>
      <pubDate>Mon, 10 Mar 2025 11:10:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/112148#M44121</guid>
      <dc:creator>louisgarza</dc:creator>
      <dc:date>2025-03-10T11:10:00Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster xxxxxxx was terminated during the run.</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/140371#M51403</link>
      <description>&lt;P&gt;A “Cluster xxxxxxx was terminated during the run” message usually means the system stopped your cluster because it ran out of resources, hit an inactivity timeout, or encountered a critical error. This can happen when a job exceeds memory limits,&amp;nbsp;&lt;SPAN&gt;&lt;A href="https://deltaexcutorv.com/" target="_self"&gt;&lt;STRONG&gt;Delta Executor Apk&lt;/STRONG&gt;&lt;/A&gt;&amp;nbsp;&lt;/SPAN&gt;the compute environment shuts down unexpectedly, or the platform automatically terminates idle clusters. Restarting the cluster, reviewing resource settings, and checking logs for failure points can help prevent the issue from occurring again.&lt;/P&gt;</description>
      <pubDate>Wed, 26 Nov 2025 07:06:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/140371#M51403</guid>
      <dc:creator>denny492</dc:creator>
      <dc:date>2025-11-26T07:06:05Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster xxxxxxx was terminated during the run.</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/140447#M51430</link>
      <description>&lt;P&gt;The FirewallSetupException is thrown when Cluster Manager tries to allow communication to newly launched containers and the node can’t apply updated iptables rules. This occurs in the code path for allowCommunicationFromOldHostsToNewContainers during add-containers/upsize operations.&lt;/P&gt;
&lt;P&gt;A very common underlying cause is the node daemon failing to write the temporary firewall rule file due to “No space left on device,” which prevents iptables-restore from applying the rules.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Common root causes seen&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;The instance’s &lt;STRONG&gt;root volume&lt;/STRONG&gt; is full (often due to archived log-daemon usage logs under &lt;CODE class="qt3gz9f"&gt;/home/ubuntu/databricks/log-daemon/work/...&lt;/CODE&gt;), leading to “No space left on device” during firewall rule generation and apply.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Node daemon RPC failures&lt;/STRONG&gt; (e.g., “Got invalid response: 404”) from the instance can also cause inbound firewall updates to fail.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;In the same window you’ll often see cluster events like “Could not register new workers with running worker …” as the upsize/add-containers retries time out.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;What you can do now (quick mitigation)&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Retry the upsize or &lt;STRONG&gt;restart the cluster&lt;/STRONG&gt; to replace the affected instances with fresh VMs, which typically clears local disk/log conditions.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you can reach the instance, quickly check disk pressure:&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Run &lt;CODE class="qt3gz9f"&gt;df -h&lt;/CODE&gt; and look for the root device (e.g., &lt;CODE class="qt3gz9f"&gt;/dev/xvda1&lt;/CODE&gt;) at 100% usage.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;If confirmed, reduce/cleanup oversized log archives on that host or replace the instance.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Mitigations:&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Shorten Spark &lt;STRONG&gt;event log rollover&lt;/STRONG&gt; to reduce pressure on local storage during long-running jobs or noisy clusters, e.g.:&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;&lt;CODE class="qt3gz9f"&gt;spark.databricks.eventLog.rolloverIntervalSeconds=300&lt;/CODE&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Cleaning up oversized &lt;STRONG&gt;log-daemon archives&lt;/STRONG&gt; on the affected host(s) restores autoscaling and allows firewall rule updates to succeed.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Wed, 26 Nov 2025 18:03:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/140447#M51430</guid>
      <dc:creator>iyashk-DB</dc:creator>
      <dc:date>2025-11-26T18:03:14Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster xxxxxxx was terminated during the run.</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/141314#M51693</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hello&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Databricks Community&lt;/STRONG&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;The driver node was lost, which might occur as a result of network problems or malfunctioning instances, according to the error message. Here are some potential causes and remedies:&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Instance Instability: Consider switching to a different instance type if your cloud provider offers unstable instances.&lt;BR /&gt;Networking Problems: Make sure that consistent connectivity between nodes is enabled by your VPC and security group settings.&lt;/P&gt;&lt;P&gt;Autoscaling Interruption: Driver instability may occasionally result from severe autoscaling. Try changing the scaling parameters.&lt;BR /&gt;Databricks Event History &amp;amp; Logs: To learn more about the reasons behind the driver's disappearance, view the logs in the Databricks event timeline.&lt;BR /&gt;You might also want to look into, a free &lt;A href="https://movieboxhd.app/" target="_self"&gt;moviebox&lt;/A&gt; program that provides easy access to content, for a seamless online experience.&lt;/P&gt;&lt;P&gt;Please let me know if you require any other help.&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;P&gt;Best regards!!&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 06 Dec 2025 12:16:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/141314#M51693</guid>
      <dc:creator>marykline</dc:creator>
      <dc:date>2025-12-06T12:16:04Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster xxxxxxx was terminated during the run.</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/148966#M52998</link>
      <description>&lt;P&gt;Ensure the driver node is not using spot/preemptible instances, as they can terminate unexpectedly.&lt;/P&gt;&lt;P&gt;Increase the driver node size (more RAM/CPU) to prevent out-of-memory crashes.&lt;/P&gt;&lt;P&gt;Check the driver logs to identify memory, JVM, or networking errors.&lt;/P&gt;&lt;P&gt;Verify your cloud instance quota limits to confirm enough nodes can be provisioned.&lt;/P&gt;&lt;P&gt;Make sure the requested instance type is available in your selected availability zone.&lt;/P&gt;&lt;P&gt;Confirm your subnet has enough free IP addresses for scaling workers.&lt;/P&gt;&lt;P&gt;Review VPC, firewall, and security group rules to allow internal cluster communication.&lt;/P&gt;&lt;P&gt;Avoid aggressive autoscaling (e.g., scaling from 1 to many nodes instantly).&lt;/P&gt;&lt;P&gt;Set a reasonable minimum worker count to reduce cold-start failures.&lt;/P&gt;&lt;P&gt;Use on-demand instances for the driver for better stability.&lt;/P&gt;&lt;P&gt;Monitor cluster metrics (CPU, memory, network) during scaling events.&lt;/P&gt;&lt;P&gt;Test autoscaling with a smaller max node limit to isolate the issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 22 Feb 2026 05:18:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-xxxxxxx-was-terminated-during-the-run/m-p/148966#M52998</guid>
      <dc:creator>joshhazel456</dc:creator>
      <dc:date>2026-02-22T05:18:00Z</dc:date>
    </item>
  </channel>
</rss>

