Executors getting killed while Scaling Spark jobs on GPU using RAPIDS(NVIDIA)

rajanchaturvedi
New Contributor

Hi Team , 

I want to take advantage of Spark Distribution over GPU clusters using RAPID(NVIDIA) , everything is setup 

1. The Jar is loaded correctly via Init script , the jar is downloaded and uploaded on volume (workspace is unity enabled) and via Init script uploaded to databricks jar location  


src="/Volumes/ml_apps_ml_dev/volumes/team-volume-ml_apps_nonprod/rapids-4-spark_2.12-25.04.0.jar"


DEST="/databricks/jars/rapids-4-spark_2.12-25.04.0.jar"

cluster that I am using 

rajanchaturvedi_0-1750067083816.png

Spark configuration that I am using 

rajanchaturvedi_1-1750067171780.png

After all this configuration I can see GPU optimizations kick in Query Execution Plan as below but when I run the spark join like join , the executors are getting killed and the spark job is stuck , kindly please help

rajanchaturvedi_2-1750067287042.png