How do I implement and train a custom PyTorch model on Databricks using distributed training?

Suheb
Contributor

How can I build my own PyTorch machine-learning model and train it faster on Databricks by using multiple machines/GPUs instead of just one?

KaushalVachhani
Databricks Employee
Databricks Employee

@Suheb , You may look at the torch distributor. It provides multiple distributed training options, including single-node with multiple-GPU training and multi-node training. Below are the references for you.

https://docs.databricks.com/aws/en/machine-learning/train-model/distributed-training/spark-pytorch-d...

https://docs.databricks.com/aws/en/notebooks/source/deep-learning/torch-distributor-lightning.html