Adam_Pavlacka
Databricks Employee
Databricks Employee

The potential root cause could be high GPU utilization while running a live experiment. This can be validated both by using the Spark UI and by using the Nvidia -smi command.

If a single GPU is explicitly used, this might cause an overload and hence OOM issues. 

To avoid OOM issues consider the following suggestions:

  • Experiment with smaller batch sizes
  • Use a larger GPU
  • Use a different framework such as Horovod