Distributed Training quits if any worker node fails

aswinkks — Wed, 28 May 2025 08:13:27 GMT

Hi,

I'm training a Pytorch model in a distributed environment using the Pytorch's DistributedDataParallel (DDP) library. I have spin up 10 worker nodes.The issue which I'm facing is that during the training, if any worker node fails and exits, the entire notebook execution fails, and i need to start from begining.

I understand that it is a limitation from the DDP for not being able to be fault tolerant. I even tried saving checkpoints, but nothing seems to work.

Is there any alternative where I can continue the training execution even if any worker node fails, and still complete successfully.

I'm even open to work on any other distributed libraries, if this limitation can be overcome.

Re: Distributed Training quits if any worker node fails

rcdatabricks — Wed, 04 Jun 2025 19:54:30 GMT

Can you provide more info on why the worker nodes are failing? are you using spot or on-demand instances?

Re: Distributed Training quits if any worker node fails

mark_ott — Thu, 13 Nov 2025 11:37:55 GMT

Distributed training with PyTorch’s DistributedDataParallel (DDP) is not inherently fault-tolerant—if any node fails, the whole job crashes, and, as you noted, checkpointing cannot auto-recover the process without hypervisor or application-level orchestration. However, there are alternative frameworks and design patterns you can adopt to build a more resilient distributed training workflow.

Options for Fault-Tolerant Distributed Training

1. Horovod with Spark

Horovod, originally built for TensorFlow and Keras but well supported for PyTorch, provides tighter Spark integration. Using HorovodRunner in Spark, if a task fails, Spark can retry it based on its job-failure policies. This allows limited fault tolerance, as executors are restarted up to a threshold, but catastrophic multi-node loss or non-transient failures can still force a full restart.

2. TorchElastic/TorchRun (now Torch Distributed Elastic)

TorchElastic (recently renamed Torch Distributed Elastic) is designed specifically for elastic and fault-tolerant distributed training with PyTorch. It allows worker nodes to scale up/down during training and can restart failed workers dynamically. You can combine it with job orchestrators like Kubernetes or YARN for robust recovery and auto-scaling. It requires your training code to be "stateless enough" to resume from checkpoints.

3. Ray Train / Ray AIR

Ray is a distributed computing framework that supports fault-tolerant training for PyTorch. With Ray Train or Ray AIR libraries, failed training workers can be rescheduled automatically, and training can continue without starting over, provided you have checkpointing configured correctly. Ray’s actor model and robust orchestration give you more resilience than pure DDP.

4. Custom Job Orchestration with Checkpointing

Use an external scheduler (Kubernetes, SLURM, YARN) to manage jobs.
Save model, optimizer, and data loader state regularly on robust storage.
On failure, relaunch training from the latest checkpoint.
This gives partial recovery but is more complex and less "seamless" than Ray or TorchElastic.

Table: Distributed Libraries with Fault Tolerance

Library/Framework	Fault Tolerant?	Autoscaling/Elasticity	Easy Spark Integration	Notes
PyTorch DDP	No	No	No	Any node failure kills the job
Horovod	Limited	No	Yes	Spark retries some failures
TorchElastic (Distributed Elastic)	Yes	Yes	Indirect	Best for Kubernetes/YARN environments
Ray Train/AIR	Yes	Yes	Yes	Checkpointing and auto-restart supported

Recommendations

For maximum Spark integration and some retries: HorovodRunner on Spark – may suffice for small clusters with rare failures.
For fully robust, elastic, production-grade training: Torch Distributed Elastic (TorchRun) with Kubernetes or YARN.
For modern, Pythonic, and highly fault-tolerant distributed ML: Ray Train/AIR – supports PyTorch and Spark environments.

Each of these requires code and infra changes, but they provide the continuity and reliability missing in standard DDP setups.

topic Distributed Training quits if any worker node fails in Machine Learning

Distributed Training quits if any worker node fails

Re: Distributed Training quits if any worker node fails

Re: Distributed Training quits if any worker node fails

Options for Fault-Tolerant Distributed Training

1. Horovod with Spark

2. TorchElastic/TorchRun (now Torch Distributed Elastic)

3. Ray Train / Ray AIR

4. Custom Job Orchestration with Checkpointing

Table: Distributed Libraries with Fault Tolerance

Recommendations