<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Distributed Training quits if any worker node fails in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/distributed-training-quits-if-any-worker-node-fails/m-p/120383#M4090</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm training a Pytorch model in a distributed environment using the Pytorch's DistributedDataParallel (DDP) library. I have spin up 10 worker nodes.The issue which I'm facing is that during the training, if any worker node fails and exits, the entire notebook execution fails, and i need to start from begining.&lt;/P&gt;&lt;P&gt;I understand that it is a limitation from the DDP for not being able to be fault tolerant. I even tried saving checkpoints, but nothing seems to work.&lt;/P&gt;&lt;P&gt;Is there any alternative where I can continue the training execution even if any worker node fails, and still complete successfully.&lt;/P&gt;&lt;P&gt;I'm even open to work on any other distributed libraries, if this limitation can be overcome.&lt;/P&gt;</description>
    <pubDate>Wed, 28 May 2025 08:13:27 GMT</pubDate>
    <dc:creator>aswinkks</dc:creator>
    <dc:date>2025-05-28T08:13:27Z</dc:date>
    <item>
      <title>Distributed Training quits if any worker node fails</title>
      <link>https://community.databricks.com/t5/machine-learning/distributed-training-quits-if-any-worker-node-fails/m-p/120383#M4090</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm training a Pytorch model in a distributed environment using the Pytorch's DistributedDataParallel (DDP) library. I have spin up 10 worker nodes.The issue which I'm facing is that during the training, if any worker node fails and exits, the entire notebook execution fails, and i need to start from begining.&lt;/P&gt;&lt;P&gt;I understand that it is a limitation from the DDP for not being able to be fault tolerant. I even tried saving checkpoints, but nothing seems to work.&lt;/P&gt;&lt;P&gt;Is there any alternative where I can continue the training execution even if any worker node fails, and still complete successfully.&lt;/P&gt;&lt;P&gt;I'm even open to work on any other distributed libraries, if this limitation can be overcome.&lt;/P&gt;</description>
      <pubDate>Wed, 28 May 2025 08:13:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/distributed-training-quits-if-any-worker-node-fails/m-p/120383#M4090</guid>
      <dc:creator>aswinkks</dc:creator>
      <dc:date>2025-05-28T08:13:27Z</dc:date>
    </item>
    <item>
      <title>Re: Distributed Training quits if any worker node fails</title>
      <link>https://community.databricks.com/t5/machine-learning/distributed-training-quits-if-any-worker-node-fails/m-p/120965#M4108</link>
      <description>&lt;P&gt;Can you provide more info on why the worker nodes are failing? are you using spot or on-demand instances?&lt;/P&gt;</description>
      <pubDate>Wed, 04 Jun 2025 19:54:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/distributed-training-quits-if-any-worker-node-fails/m-p/120965#M4108</guid>
      <dc:creator>rcdatabricks</dc:creator>
      <dc:date>2025-06-04T19:54:30Z</dc:date>
    </item>
    <item>
      <title>Re: Distributed Training quits if any worker node fails</title>
      <link>https://community.databricks.com/t5/machine-learning/distributed-training-quits-if-any-worker-node-fails/m-p/138907#M4437</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Distributed training with PyTorch’s DistributedDataParallel (DDP) is not inherently fault-tolerant—if any node fails, the whole job crashes, and, as you noted, checkpointing cannot auto-recover the process without hypervisor or application-level orchestration. However, there are alternative frameworks and design patterns you can adopt to build a more resilient distributed training workflow.&lt;/P&gt;
&lt;H2 id="options-for-fault-tolerant-distributed-training" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Options for Fault-Tolerant Distributed Training&lt;/H2&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;1.&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Horovod with Spark&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Horovod, originally built for TensorFlow and Keras but well supported for PyTorch, provides tighter Spark integration. Using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;HorovodRunner&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in Spark, if a task fails, Spark can retry it based on its job-failure policies. This allows&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;limited&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;fault tolerance, as executors are restarted up to a threshold, but catastrophic multi-node loss or non-transient failures can still force a full restart.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;2.&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;TorchElastic/TorchRun (now Torch Distributed Elastic)&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;TorchElastic (recently renamed&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Torch Distributed Elastic&lt;/STRONG&gt;) is designed specifically for elastic and fault-tolerant distributed training with PyTorch. It allows worker nodes to scale up/down during training and can restart failed workers dynamically. You can combine it with job orchestrators like Kubernetes or YARN for robust recovery and auto-scaling. It requires your training code to be "stateless enough" to resume from checkpoints.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;3.&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Ray Train / Ray AIR&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Ray is a distributed computing framework that supports fault-tolerant training for PyTorch. With Ray Train or Ray AIR libraries, failed training workers can be rescheduled automatically, and training can continue without starting over, provided you have checkpointing configured correctly. Ray’s actor model and robust orchestration give you more resilience than pure DDP.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;4.&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Custom Job Orchestration with Checkpointing&lt;/STRONG&gt;&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Use an external scheduler (Kubernetes, SLURM, YARN) to manage jobs.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Save model, optimizer, and data loader state regularly on robust storage.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;On failure, relaunch training from the latest checkpoint.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This gives partial recovery but is more complex and less "seamless" than Ray or TorchElastic.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 id="table-distributed-libraries-with-fault-tolerance" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Table: Distributed Libraries with Fault Tolerance&lt;/H2&gt;
&lt;DIV class="group relative"&gt;
&lt;DIV class="w-full overflow-x-auto md:max-w-[90vw] border-subtlest ring-subtlest divide-subtlest bg-transparent"&gt;
&lt;TABLE class="border-subtler my-[1em] w-full table-auto border-separate border-spacing-0 border-l border-t"&gt;
&lt;THEAD class="bg-subtler"&gt;
&lt;TR&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Library/Framework&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Fault Tolerant?&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Autoscaling/Elasticity&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Easy Spark Integration&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Notes&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;PyTorch DDP&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;No&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;No&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;No&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Any node failure kills the job&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Horovod&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Limited&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;No&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Spark retries some failures&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;TorchElastic (Distributed Elastic)&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Indirect&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Best for Kubernetes/YARN environments&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Ray Train/AIR&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Checkpointing and auto-restart supported&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;/DIV&gt;
&lt;DIV class="bg-base border-subtler shadow-subtle pointer-coarse:opacity-100 right-xs absolute bottom-0 flex rounded-lg border opacity-0 transition-opacity group-hover:opacity-100 [&amp;amp;&amp;gt;*:not(:first-child)]:border-subtle [&amp;amp;&amp;gt;*:not(:first-child)]:border-l"&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H2 id="recommendations" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Recommendations&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For maximum Spark integration and some retries:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;HorovodRunner on Spark&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;– may suffice for small clusters with rare failures.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For fully robust, elastic, production-grade training:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Torch Distributed Elastic&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(TorchRun) with Kubernetes or YARN.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For modern, Pythonic, and highly fault-tolerant distributed ML:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Ray Train/AIR&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;– supports PyTorch and Spark environments.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Each of these requires code and infra changes, but they provide the continuity and reliability missing in standard DDP setups.&lt;/P&gt;</description>
      <pubDate>Thu, 13 Nov 2025 11:37:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/distributed-training-quits-if-any-worker-node-fails/m-p/138907#M4437</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-13T11:37:55Z</dc:date>
    </item>
  </channel>
</rss>

