Adam_Pavlacka
Databricks Employee
Options
- Article History
- Subscribe to RSS Feed
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
on 01-10-2024 05:00 PM
You should use distributed training.
By distributing the training workload among GPUs or worker nodes, you can optimize resource utilization and reduce the likelihood of ConnectionException errors and out of memory (OOM) issues.
A good option for distributed training is Horovod, a distributed deep learning framework.
The following resources can provide guidance on how to set up and use Horovod with Databricks: