cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Distributed Training quits if any worker node fails

aswinkks
New Contributor III

Hi,

I'm training a Pytorch model in a distributed environment using the Pytorch's DistributedDataParallel (DDP) library. I have spin up 10 worker nodes.The issue which I'm facing is that during the training, if any worker node fails and exits, the entire notebook execution fails, and i need to start from begining.

I understand that it is a limitation from the DDP for not being able to be fault tolerant. I even tried saving checkpoints, but nothing seems to work.

Is there any alternative where I can continue the training execution even if any worker node fails, and still complete successfully.

I'm even open to work on any other distributed libraries, if this limitation can be overcome.

1 REPLY 1

rcdatabricks
New Contributor III

Can you provide more info on why the worker nodes are failing? are you using spot or on-demand instances?