04-13-2022 01:24 PM
Hi, Im trying to use the databricks platform to do the pytorch distributed training, but I didnt find any info about this. What I expected is using multiple clusters to run a common job using pytorch distributed data parallel (DDP) with the code below:
On device 1: %sh python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 train_something.py
On device 2: %sh python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 train_something.py
This is definitely supported by other computation platform like slurm, but it failed in the databricks. Could you let me know whether you do support this? or you will consider to add this feature for the later developments. Thank you in advance!
04-14-2022 07:12 AM
@Shaomu Tan , can you check sparktorch?
The parallel processing on Databricks clusters is mainly based on Apache Spark™. So to use the parallel processing, the library in question (PyTorch) has to be written for Spark. spark torch is an attempt to do just that.
You can also run Apache Ray on Databricks or Dask (I thought that was possible too), so bypassing Apache spark
04-14-2022 07:12 AM
@Shaomu Tan , can you check sparktorch?
The parallel processing on Databricks clusters is mainly based on Apache Spark™. So to use the parallel processing, the library in question (PyTorch) has to be written for Spark. spark torch is an attempt to do just that.
You can also run Apache Ray on Databricks or Dask (I thought that was possible too), so bypassing Apache spark
04-26-2022 03:37 AM
Hi @Shaomu Tan , Just a friendly follow-up. Do you still need help, or @Werner Stinckens 's response help you to find the solution? Please let us know.
02-19-2023 08:15 AM
With Databricks MLR, HorovodRunner is provided which supports distributed training and inference with PyTorch. Here's an example notebook for your reference: PyTorchDistributedDeepLearningTraining - Databricks.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.