Hi,
I'm training a Pytorch model in a distributed environment using the Pytorch's DistributedDataParallel (DDP) library. I have spin up 10 worker nodes.The issue which I'm facing is that during the training, if any worker node fails and exits, the entire notebook execution fails, and i need to start from begining.
I understand that it is a limitation from the DDP for not being able to be fault tolerant. I even tried saving checkpoints, but nothing seems to work.
Is there any alternative where I can continue the training execution even if any worker node fails, and still complete successfully.
I'm even open to work on any other distributed libraries, if this limitation can be overcome.