<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Executors Getting FORCE_KILL After Migration to GCE – Resource Scaling Not Helping in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/executors-getting-force-kill-after-migration-to-gce-resource/m-p/131698#M49196</link>
    <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/135091"&gt;@minhhung0507&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;We're facing a persistent issue with our production streaming pipelines where executors are being &lt;STRONG&gt;forcefully killed&lt;/STRONG&gt; with the following error:&lt;/P&gt;&lt;PRE&gt;Executor got terminated abnormally due to FORCE_KILL&lt;/PRE&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;I solved the issue in our case and also think that I know now why it happened in the first place. In our DLT workload the amount of state information is unusually (at least i think so) high compared to the volume of data that we are processing in total. I learned meanwhile that RocksDB (in which the state information is stored) operates outside JVM, meaning it uses the workers non-heap memory. And if the non-heap memory consumption goes up too high your worker will just be killed by its OS.&lt;BR /&gt;&lt;BR /&gt;I changed several settings in spark.conf, so i can't tell you which one exactly solved our issue, or if it was all of them in combination, but here is what i changed:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;allocated more non-heap memory to workers (see&amp;nbsp;&lt;A href="https://spark.apache.org/docs/latest/configuration.html" target="_blank" rel="noopener"&gt;Spark documentation&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;limited RocksDB memory usage and tuned some other RocksDB related settings&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;The settings I'm talking about in (2) can be found here:&lt;BR /&gt;&lt;A href="https://aws.amazon.com/blogs/big-data/rocksdb-101-optimizing-stateful-streaming-in-apache-spark-with-amazon-emr-and-aws-glue/" target="_blank" rel="noopener"&gt;rocksdb-101-optimizing-stateful-streaming-in-apache-spark-with-amazon-emr-and-aws-glue&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;I found this resource extremely helpful for the explanations it provides as well as the suggested "defaults".&lt;/P&gt;</description>
    <pubDate>Thu, 11 Sep 2025 18:38:42 GMT</pubDate>
    <dc:creator>thomas-totter</dc:creator>
    <dc:date>2025-09-11T18:38:42Z</dc:date>
    <item>
      <title>Executors Getting FORCE_KILL After Migration to GCE – Resource Scaling Not Helping</title>
      <link>https://community.databricks.com/t5/data-engineering/executors-getting-force-kill-after-migration-to-gce-resource/m-p/122784#M46868</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;We're facing a persistent issue with our production streaming pipelines where executors are being &lt;STRONG&gt;forcefully killed&lt;/STRONG&gt; with the following error:&lt;/P&gt;&lt;PRE&gt;Executor got terminated abnormally due to FORCE_KILL&lt;/PRE&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":pushpin:"&gt;📌&lt;/span&gt; &lt;STRONG&gt;Screenshot for reference:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="minhhung0507_1-1750841419951.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17755i8FCD782D1409C6A8/image-size/medium?v=v2&amp;amp;px=400" role="button" title="minhhung0507_1-1750841419951.png" alt="minhhung0507_1-1750841419951.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="minhhung0507_2-1750841451453.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17756i35C94681C465E68C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="minhhung0507_2-1750841451453.png" alt="minhhung0507_2-1750841451453.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Context:&lt;/STRONG&gt;&lt;/H3&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Our pipelines create streaming tables using Delta Live Tables.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;This issue &lt;STRONG&gt;only started happening after Databricks migrated from GKE to GCE&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;We initially ran the job on &lt;STRONG&gt;2 workers with 16 cores each&lt;/STRONG&gt;, but due to failures, we tried scaling up gradually:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;3×16-core&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;2×32-core (equivalent to 4×16-core)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;even tried 5×32-core workers.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Despite the aggressive scaling, &lt;STRONG&gt;executors still get force-killed&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;When we monitor resource usage, we notice &lt;STRONG&gt;executors are only using ~70% CPU&lt;/STRONG&gt;, and &lt;STRONG&gt;the job is killed before even completing the first batch&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H3&gt;&lt;STRONG&gt;Questions:&lt;/STRONG&gt;&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Has anyone experienced a similar behavior after the move to GCE?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;What could be causing FORCE_KILL on relatively idle executors (only ~70% utilization)?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Are there known configurations or cluster policies in GCE that could trigger such early termination?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Could this be related to DLT’s retry policy or hidden limits at the infrastructure level?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Any insights or recommendations are greatly appreciated!&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 08:51:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executors-getting-force-kill-after-migration-to-gce-resource/m-p/122784#M46868</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-06-25T08:51:19Z</dc:date>
    </item>
    <item>
      <title>Re: Executors Getting FORCE_KILL After Migration to GCE – Resource Scaling Not Helping</title>
      <link>https://community.databricks.com/t5/data-engineering/executors-getting-force-kill-after-migration-to-gce-resource/m-p/127205#M47891</link>
      <description>&lt;P&gt;We have the exact same issue since very recently, but we are on Azure...&lt;/P&gt;</description>
      <pubDate>Sat, 02 Aug 2025 01:29:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executors-getting-force-kill-after-migration-to-gce-resource/m-p/127205#M47891</guid>
      <dc:creator>thomas-totter</dc:creator>
      <dc:date>2025-08-02T01:29:04Z</dc:date>
    </item>
    <item>
      <title>Re: Executors Getting FORCE_KILL After Migration to GCE – Resource Scaling Not Helping</title>
      <link>https://community.databricks.com/t5/data-engineering/executors-getting-force-kill-after-migration-to-gce-resource/m-p/131698#M49196</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/135091"&gt;@minhhung0507&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;We're facing a persistent issue with our production streaming pipelines where executors are being &lt;STRONG&gt;forcefully killed&lt;/STRONG&gt; with the following error:&lt;/P&gt;&lt;PRE&gt;Executor got terminated abnormally due to FORCE_KILL&lt;/PRE&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;I solved the issue in our case and also think that I know now why it happened in the first place. In our DLT workload the amount of state information is unusually (at least i think so) high compared to the volume of data that we are processing in total. I learned meanwhile that RocksDB (in which the state information is stored) operates outside JVM, meaning it uses the workers non-heap memory. And if the non-heap memory consumption goes up too high your worker will just be killed by its OS.&lt;BR /&gt;&lt;BR /&gt;I changed several settings in spark.conf, so i can't tell you which one exactly solved our issue, or if it was all of them in combination, but here is what i changed:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;allocated more non-heap memory to workers (see&amp;nbsp;&lt;A href="https://spark.apache.org/docs/latest/configuration.html" target="_blank" rel="noopener"&gt;Spark documentation&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;limited RocksDB memory usage and tuned some other RocksDB related settings&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;The settings I'm talking about in (2) can be found here:&lt;BR /&gt;&lt;A href="https://aws.amazon.com/blogs/big-data/rocksdb-101-optimizing-stateful-streaming-in-apache-spark-with-amazon-emr-and-aws-glue/" target="_blank" rel="noopener"&gt;rocksdb-101-optimizing-stateful-streaming-in-apache-spark-with-amazon-emr-and-aws-glue&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;I found this resource extremely helpful for the explanations it provides as well as the suggested "defaults".&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 18:38:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executors-getting-force-kill-after-migration-to-gce-resource/m-p/131698#M49196</guid>
      <dc:creator>thomas-totter</dc:creator>
      <dc:date>2025-09-11T18:38:42Z</dc:date>
    </item>
  </channel>
</rss>

