<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Executors getting killed while Scaling Spark jobs on GPU using RAPIDS(NVIDIA) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/executors-getting-killed-while-scaling-spark-jobs-on-gpu-using/m-p/137332#M50725</link>
    <description>&lt;P&gt;hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/140212"&gt;@rajanchaturvedi&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Executor termination, especially when scaling a GPU-accelerated job, is almost always due to &lt;STRONG&gt;memory over-allocation&lt;/STRONG&gt; (Out Of Memory, or OOM) on the worker nodes, which causes the cluster manager to kill the process. This is exacerbated in GPU environments because two large memory spaces must be managed: the CPU (JVM) heap and the dedicated GPU memory.&lt;/P&gt;
&lt;P&gt;Can you please check and share the cause of the&amp;nbsp;&lt;SPAN&gt;executors are getting killed. Also the job being stuck does not sound right? Are you certain there is nothing running?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 03 Nov 2025 02:50:03 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2025-11-03T02:50:03Z</dc:date>
    <item>
      <title>Executors getting killed while Scaling Spark jobs on GPU using RAPIDS(NVIDIA)</title>
      <link>https://community.databricks.com/t5/data-engineering/executors-getting-killed-while-scaling-spark-jobs-on-gpu-using/m-p/121862#M46576</link>
      <description>&lt;P&gt;Hi Team ,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I want to take advantage of Spark Distribution over GPU clusters using RAPID(NVIDIA) , everything is setup&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;1. The Jar is loaded correctly via Init script , the jar is downloaded and uploaded on volume (workspace is unity enabled) and via Init script uploaded to databricks jar location&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;src="/Volumes/ml_apps_ml_dev/volumes/team-volume-ml_apps_nonprod/rapids-4-spark_2.12-25.04.0.jar"&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;DEST="/databricks/jars/rapids-4-spark_2.12-25.04.0.jar"&lt;BR /&gt;&lt;BR /&gt;cluster that I am using&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rajanchaturvedi_0-1750067083816.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17546iF97AD09BB8FA44DD/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rajanchaturvedi_0-1750067083816.png" alt="rajanchaturvedi_0-1750067083816.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;Spark configuration that I am using&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rajanchaturvedi_1-1750067171780.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17547i31BA6A8A82D9E4E7/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rajanchaturvedi_1-1750067171780.png" alt="rajanchaturvedi_1-1750067171780.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;After all this configuration I can see GPU optimizations kick in Query Execution Plan as below but when I run the spark join like join , the executors are getting killed and the spark job is stuck , kindly please help&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rajanchaturvedi_2-1750067287042.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17548iF83F984193771EC1/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rajanchaturvedi_2-1750067287042.png" alt="rajanchaturvedi_2-1750067287042.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Jun 2025 09:49:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executors-getting-killed-while-scaling-spark-jobs-on-gpu-using/m-p/121862#M46576</guid>
      <dc:creator>rajanchaturvedi</dc:creator>
      <dc:date>2025-06-16T09:49:47Z</dc:date>
    </item>
    <item>
      <title>Re: Executors getting killed while Scaling Spark jobs on GPU using RAPIDS(NVIDIA)</title>
      <link>https://community.databricks.com/t5/data-engineering/executors-getting-killed-while-scaling-spark-jobs-on-gpu-using/m-p/137332#M50725</link>
      <description>&lt;P&gt;hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/140212"&gt;@rajanchaturvedi&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Executor termination, especially when scaling a GPU-accelerated job, is almost always due to &lt;STRONG&gt;memory over-allocation&lt;/STRONG&gt; (Out Of Memory, or OOM) on the worker nodes, which causes the cluster manager to kill the process. This is exacerbated in GPU environments because two large memory spaces must be managed: the CPU (JVM) heap and the dedicated GPU memory.&lt;/P&gt;
&lt;P&gt;Can you please check and share the cause of the&amp;nbsp;&lt;SPAN&gt;executors are getting killed. Also the job being stuck does not sound right? Are you certain there is nothing running?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Nov 2025 02:50:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executors-getting-killed-while-scaling-spark-jobs-on-gpu-using/m-p/137332#M50725</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-11-03T02:50:03Z</dc:date>
    </item>
    <item>
      <title>Re: Executors getting killed while Scaling Spark jobs on GPU using RAPIDS(NVIDIA)</title>
      <link>https://community.databricks.com/t5/data-engineering/executors-getting-killed-while-scaling-spark-jobs-on-gpu-using/m-p/137346#M50726</link>
      <description>&lt;P&gt;Also try to gradually reduce&amp;nbsp;&lt;STRONG&gt;&lt;CODE&gt;spark.executor.memory&lt;/CODE&gt;&lt;/STRONG&gt;&amp;nbsp;You need to allocate less memory to the JVM heap because the GPU needs a large chunk of the node's &lt;I&gt;off-heap&lt;/I&gt; (system) memory. The GPU memory is allocated outside the JVM heap. If the heap is too large, it crowds out the native memory required by RAPIDS/CUDA.&lt;/P&gt;
&lt;P&gt;Again reduce gradually&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;CODE&gt;spark.rapids.memory.gpu.allocFraction&lt;/CODE&gt;&lt;/STRONG&gt;&amp;nbsp;from the default (usually &lt;CODE&gt;0.5&lt;/CODE&gt; or &lt;CODE&gt;0.8&lt;/CODE&gt;). Try &lt;STRONG&gt;&lt;CODE&gt;0.4&lt;/CODE&gt;&lt;/STRONG&gt; or &lt;STRONG&gt;&lt;CODE&gt;0.3&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Nov 2025 07:22:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executors-getting-killed-while-scaling-spark-jobs-on-gpu-using/m-p/137346#M50726</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-11-03T07:22:01Z</dc:date>
    </item>
  </channel>
</rss>

