on 01-10-2024 05:00 PM
The potential root cause could be high GPU utilization while running a live experiment. This can be validated both by using the Spark UI and by using the Nvidia -smi command.
If a single GPU is explicitly used, this might cause an overload and hence OOM issues.
To avoid OOM issues consider the following suggestions: