on ‎01-10-2024 05:00 PM
You should use distributed training.
By distributing the training workload among GPUs or worker nodes, you can optimize resource utilization and reduce the likelihood of ConnectionException errors and out of memory (OOM) issues.
A good option for distributed training is Horovod, a distributed deep learning framework.
The following resources can provide guidance on how to set up and use Horovod with Databricks: