cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Distributed Training quits if any worker node fails

aswinkks
New Contributor III

Hi,

I'm training a Pytorch model in a distributed environment using the Pytorch's DistributedDataParallel (DDP) library. I have spin up 10 worker nodes.The issue which I'm facing is that during the training, if any worker node fails and exits, the entire notebook execution fails, and i need to start from begining.

I understand that it is a limitation from the DDP for not being able to be fault tolerant. I even tried saving checkpoints, but nothing seems to work.

Is there any alternative where I can continue the training execution even if any worker node fails, and still complete successfully.

I'm even open to work on any other distributed libraries, if this limitation can be overcome.

1 REPLY 1

rcdatabricks
New Contributor III

Can you provide more info on why the worker nodes are failing? are you using spot or on-demand instances?

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now