cancel
Showing results for 
Search instead for 
Did you mean: 
Support FAQs
Find answers to common questions and troubleshoot issues with Databricks support FAQs. Access helpful resources, tips, and solutions to resolve technical challenges and enhance your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 
Adam_Pavlacka
Databricks Employee
Databricks Employee

You should use distributed training.

By distributing the training workload among GPUs or worker nodes, you can optimize resource utilization and reduce the likelihood of ConnectionException errors and out of memory (OOM) issues.

A good option for distributed training is Horovod, a distributed deep learning framework.

The following resources can provide guidance on how to set up and use Horovod with Databricks:

  • HorovodRunner: distributed deep learning with Horovod (AWS | Azure | GCP)
  • Distributed training (AWS | Azure | GCP)
Version history
Last update:
‎01-10-2024 05:00 PM
Updated by:
Contributors