cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Distributed Training quits if any worker node fails

aswinkks
New Contributor III

Hi,

I'm training a Pytorch model in a distributed environment using the Pytorch's DistributedDataParallel (DDP) library. I have spin up 10 worker nodes.The issue which I'm facing is that during the training, if any worker node fails and exits, the entire notebook execution fails, and i need to start from begining.

I understand that it is a limitation from the DDP for not being able to be fault tolerant. I even tried saving checkpoints, but nothing seems to work.

Is there any alternative where I can continue the training execution even if any worker node fails, and still complete successfully.

I'm even open to work on any other distributed libraries, if this limitation can be overcome.

2 REPLIES 2

rcdatabricks
New Contributor III

Can you provide more info on why the worker nodes are failing? are you using spot or on-demand instances?

mark_ott
Databricks Employee
Databricks Employee

Distributed training with PyTorch’s DistributedDataParallel (DDP) is not inherently fault-tolerant—if any node fails, the whole job crashes, and, as you noted, checkpointing cannot auto-recover the process without hypervisor or application-level orchestration. However, there are alternative frameworks and design patterns you can adopt to build a more resilient distributed training workflow.

Options for Fault-Tolerant Distributed Training

1. Horovod with Spark

Horovod, originally built for TensorFlow and Keras but well supported for PyTorch, provides tighter Spark integration. Using HorovodRunner in Spark, if a task fails, Spark can retry it based on its job-failure policies. This allows limited fault tolerance, as executors are restarted up to a threshold, but catastrophic multi-node loss or non-transient failures can still force a full restart.

2. TorchElastic/TorchRun (now Torch Distributed Elastic)

TorchElastic (recently renamed Torch Distributed Elastic) is designed specifically for elastic and fault-tolerant distributed training with PyTorch. It allows worker nodes to scale up/down during training and can restart failed workers dynamically. You can combine it with job orchestrators like Kubernetes or YARN for robust recovery and auto-scaling. It requires your training code to be "stateless enough" to resume from checkpoints.

3. Ray Train / Ray AIR

Ray is a distributed computing framework that supports fault-tolerant training for PyTorch. With Ray Train or Ray AIR libraries, failed training workers can be rescheduled automatically, and training can continue without starting over, provided you have checkpointing configured correctly. Ray’s actor model and robust orchestration give you more resilience than pure DDP.

4. Custom Job Orchestration with Checkpointing

  • Use an external scheduler (Kubernetes, SLURM, YARN) to manage jobs.

  • Save model, optimizer, and data loader state regularly on robust storage.

  • On failure, relaunch training from the latest checkpoint.

  • This gives partial recovery but is more complex and less "seamless" than Ray or TorchElastic.

Table: Distributed Libraries with Fault Tolerance

Library/Framework Fault Tolerant? Autoscaling/Elasticity Easy Spark Integration Notes
PyTorch DDP No No No Any node failure kills the job
Horovod Limited No Yes Spark retries some failures
TorchElastic (Distributed Elastic) Yes Yes Indirect Best for Kubernetes/YARN environments
Ray Train/AIR Yes Yes Yes Checkpointing and auto-restart supported
 
 

Recommendations

  • For maximum Spark integration and some retries: HorovodRunner on Spark – may suffice for small clusters with rare failures.

  • For fully robust, elastic, production-grade training: Torch Distributed Elastic (TorchRun) with Kubernetes or YARN.

  • For modern, Pythonic, and highly fault-tolerant distributed ML: Ray Train/AIR – supports PyTorch and Spark environments.

Each of these requires code and infra changes, but they provide the continuity and reliability missing in standard DDP setups.