Help Needed: Executor Lost Error in Multi-Node Distributed Training with PyTorch
Hi everyone,I'm currently working on distributed training of a PyTorch model, following the example provided here. The training runs perfectly on a single node with a single GPU. However, when I attempt multi-node training using the following configu...
- 194 Views
- 1 replies
- 0 kudos
We do not recommend using spot instances with distributed ML training workloads that use barrier mode, like TorchDistributor as these workloads are extremely sensitive to executor loss. Please disable spot/pre-emption and try again.
- 0 kudos