Distributed training with PyTorch’s DistributedDataParallel (DDP) is not inherently fault-tolerant—if any node fails, the whole job crashes, and, as you noted, checkpointing cannot auto-recover the process without hypervisor or application-level orchestration. However, there are alternative frameworks and design patterns you can adopt to build a more resilient distributed training workflow.
Options for Fault-Tolerant Distributed Training
1. Horovod with Spark
Horovod, originally built for TensorFlow and Keras but well supported for PyTorch, provides tighter Spark integration. Using HorovodRunner in Spark, if a task fails, Spark can retry it based on its job-failure policies. This allows limited fault tolerance, as executors are restarted up to a threshold, but catastrophic multi-node loss or non-transient failures can still force a full restart.
2. TorchElastic/TorchRun (now Torch Distributed Elastic)
TorchElastic (recently renamed Torch Distributed Elastic) is designed specifically for elastic and fault-tolerant distributed training with PyTorch. It allows worker nodes to scale up/down during training and can restart failed workers dynamically. You can combine it with job orchestrators like Kubernetes or YARN for robust recovery and auto-scaling. It requires your training code to be "stateless enough" to resume from checkpoints.
3. Ray Train / Ray AIR
Ray is a distributed computing framework that supports fault-tolerant training for PyTorch. With Ray Train or Ray AIR libraries, failed training workers can be rescheduled automatically, and training can continue without starting over, provided you have checkpointing configured correctly. Ray’s actor model and robust orchestration give you more resilience than pure DDP.
4. Custom Job Orchestration with Checkpointing
-
Use an external scheduler (Kubernetes, SLURM, YARN) to manage jobs.
-
Save model, optimizer, and data loader state regularly on robust storage.
-
On failure, relaunch training from the latest checkpoint.
-
This gives partial recovery but is more complex and less "seamless" than Ray or TorchElastic.
Table: Distributed Libraries with Fault Tolerance
| Library/Framework |
Fault Tolerant? |
Autoscaling/Elasticity |
Easy Spark Integration |
Notes |
| PyTorch DDP |
No |
No |
No |
Any node failure kills the job |
| Horovod |
Limited |
No |
Yes |
Spark retries some failures |
| TorchElastic (Distributed Elastic) |
Yes |
Yes |
Indirect |
Best for Kubernetes/YARN environments |
| Ray Train/AIR |
Yes |
Yes |
Yes |
Checkpointing and auto-restart supported |
Recommendations
-
For maximum Spark integration and some retries: HorovodRunner on Spark – may suffice for small clusters with rare failures.
-
For fully robust, elastic, production-grade training: Torch Distributed Elastic (TorchRun) with Kubernetes or YARN.
-
For modern, Pythonic, and highly fault-tolerant distributed ML: Ray Train/AIR – supports PyTorch and Spark environments.
Each of these requires code and infra changes, but they provide the continuity and reliability missing in standard DDP setups.