cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Does Databricks supports the Pytorch Distributed Training for multiple devices?

Smu_Tan
New Contributor

Hi, Im trying to use the databricks platform to do the pytorch distributed training, but I didnt find any info about this. What I expected is using multiple clusters to run a common job using pytorch distributed data parallel (DDP) with the code below:

On device 1: %sh python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 train_something.py

On device 2: %sh python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 train_something.py

This is definitely supported by other computation platform like slurm, but it failed in the databricks. Could you let me know whether you do support this? or you will consider to add this feature for the later developments. Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

@Shaomu Tan​ , can you check sparktorch?

The parallel processing on Databricks clusters is mainly based on Apache Spark™. So to use the parallel processing, the library in question (PyTorch) has to be written for Spark. spark torch is an attempt to do just that.

You can also run Apache Ray on Databricks or Dask (I thought that was possible too), so bypassing Apache spark

View solution in original post

2 REPLIES 2

-werners-
Esteemed Contributor III

@Shaomu Tan​ , can you check sparktorch?

The parallel processing on Databricks clusters is mainly based on Apache Spark™. So to use the parallel processing, the library in question (PyTorch) has to be written for Spark. spark torch is an attempt to do just that.

You can also run Apache Ray on Databricks or Dask (I thought that was possible too), so bypassing Apache spark

axb0
Databricks Employee
Databricks Employee

With Databricks MLR, HorovodRunner is provided which supports distributed training and inference with PyTorch. Here's an example notebook for your reference: PyTorchDistributedDeepLearningTraining - Databricks.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group