What is the best way to train a DeepLearning model to ensure we do not encounter out of memory (OOM) errors?

Support FAQs

Find answers to common questions and troubleshoot issues with Databricks support FAQs. Access helpful resources, tips, and solutions to resolve technical challenges and enhance your Databricks experience.

You should use distributed training.

By distributing the training workload among GPUs or worker nodes, you can optimize resource utilization and reduce the likelihood of ConnectionException errors and out of memory (OOM) issues.

A good option for distributed training is Horovod, a distributed deep learning framework.

The following resources can provide guidance on how to set up and use Horovod with Databricks:

HorovodRunner: distributed deep learning with Horovod (AWS | Azure | GCP)
Distributed training (AWS | Azure | GCP)

Databricks Community

What is the best way to train a DeepLearning model to ensure we do not encounter out of memory (OOM) errors?