<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Does Databricks supports the Pytorch Distributed Training for multiple devices? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100411#M40288</link>
    <description>&lt;P&gt;Hey, so we even can't use the TorchDistributor and Distributed Data Parallel to achieve the distributed training thing in my code, and `TorchDistributor` is a spark written distribution library, coz with this setup I am not able to get the the required distributed training that expected .. second worker node have no ups in the metrics side. .. giving this reply more path, ^^ essentially how should we do the distributed training in a databricks multi node setup which have 1 driver with 1 worker.&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34408"&gt;@axb0&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/80659"&gt;@Smu_Tan&lt;/a&gt;&amp;nbsp;, should we move out of pytorch fully for this purpose or use a complete spark code to achieve this, or there's any dependancy which can provide help with this approach.&lt;/P&gt;</description>
    <pubDate>Fri, 29 Nov 2024 10:04:13 GMT</pubDate>
    <dc:creator>adarsh8304</dc:creator>
    <dc:date>2024-11-29T10:04:13Z</dc:date>
    <item>
      <title>Does Databricks supports the Pytorch Distributed Training for multiple devices?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/22908#M15776</link>
      <description>&lt;P&gt;Hi, Im trying to use the databricks platform to do the pytorch distributed training, but I didnt find any info about this. What I expected is using multiple clusters to run a common job using pytorch distributed data parallel (DDP) with the code below:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;On device 1: %sh python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 train_something.py&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;On device 2: %sh python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 train_something.py&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is definitely supported by other computation platform like slurm, but it failed in the databricks. Could you let me know whether you do support this? or you will consider to add this feature for the later developments. Thank you in advance!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Apr 2022 20:24:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/22908#M15776</guid>
      <dc:creator>Smu_Tan</dc:creator>
      <dc:date>2022-04-13T20:24:43Z</dc:date>
    </item>
    <item>
      <title>Re: Does Databricks supports the Pytorch Distributed Training for multiple devices?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/22909#M15777</link>
      <description>&lt;P&gt;@Shaomu Tan​&amp;nbsp;, can you check &lt;A href="https://pypi.org/project/sparktorch/" alt="https://pypi.org/project/sparktorch/" target="_blank"&gt;sparktorch&lt;/A&gt;?&lt;/P&gt;&lt;P&gt;The parallel processing on Databricks clusters is mainly based on Apache Spark™. So to use the parallel processing, the library in question (PyTorch) has to be written for Spark. spark torch is an attempt to do just that.&lt;/P&gt;&lt;P&gt;You can also &lt;A href="https://databricks.com/blog/2021/11/19/ray-on-databricks.html" alt="https://databricks.com/blog/2021/11/19/ray-on-databricks.html" target="_blank"&gt;run Apache Ray on Databricks&lt;/A&gt; or Dask (I thought that was possible too), so bypassing Apache spark&lt;/P&gt;</description>
      <pubDate>Thu, 14 Apr 2022 14:12:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/22909#M15777</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-04-14T14:12:50Z</dc:date>
    </item>
    <item>
      <title>Re: Does Databricks supports the Pytorch Distributed Training for multiple devices?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/22911#M15779</link>
      <description>&lt;P&gt;With Databricks MLR, HorovodRunner is provided which supports distributed training and inference with PyTorch. Here's an example notebook for your reference: &lt;A href="https://www.databricks.com/notebooks/gallery/PyTorchDistributedDeepLearningTraining.html" alt="https://www.databricks.com/notebooks/gallery/PyTorchDistributedDeepLearningTraining.html" target="_blank"&gt;&lt;I&gt;&lt;U&gt;PyTorchDistributedDeepLearningTraining - Databricks&lt;/U&gt;&lt;/I&gt;&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Sun, 19 Feb 2023 16:15:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/22911#M15779</guid>
      <dc:creator>axb0</dc:creator>
      <dc:date>2023-02-19T16:15:27Z</dc:date>
    </item>
    <item>
      <title>Re: Does Databricks supports the Pytorch Distributed Training for multiple devices?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100411#M40288</link>
      <description>&lt;P&gt;Hey, so we even can't use the TorchDistributor and Distributed Data Parallel to achieve the distributed training thing in my code, and `TorchDistributor` is a spark written distribution library, coz with this setup I am not able to get the the required distributed training that expected .. second worker node have no ups in the metrics side. .. giving this reply more path, ^^ essentially how should we do the distributed training in a databricks multi node setup which have 1 driver with 1 worker.&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34408"&gt;@axb0&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/80659"&gt;@Smu_Tan&lt;/a&gt;&amp;nbsp;, should we move out of pytorch fully for this purpose or use a complete spark code to achieve this, or there's any dependancy which can provide help with this approach.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Nov 2024 10:04:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100411#M40288</guid>
      <dc:creator>adarsh8304</dc:creator>
      <dc:date>2024-11-29T10:04:13Z</dc:date>
    </item>
    <item>
      <title>Re: Does Databricks supports the Pytorch Distributed Training for multiple devices?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100413#M40290</link>
      <description>&lt;P&gt;Since you replied on a rather old topic: TorchDistributor enables pytorch on spark in distributed mode.&lt;BR /&gt;But a cluster with only 1 worked and 1 driver will not run in distributed mode.&lt;BR /&gt;The driver does not execute spark tasks, it handles spark overhead and f.e. python code outside of spark.&lt;BR /&gt;If you want to run in distributed mode you should have at least 2 workers (and always a driver).&lt;/P&gt;</description>
      <pubDate>Fri, 29 Nov 2024 10:19:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100413#M40290</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-11-29T10:19:19Z</dc:date>
    </item>
    <item>
      <title>Re: Does Databricks supports the Pytorch Distributed Training for multiple devices?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100856#M40444</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;thanks for answering first, why then the metrics of cpu, mem utilisation we are getting in driver only and worker seems still, with less utilisation of any training, with torch distributor I think atleast that one worker should be in use, right ?&lt;BR /&gt;&lt;BR /&gt;one more thing, are the databricks driver machine designed in such a way that makes it less optimal and performant for the model training and inference tasks. as databricks implies that the code should be in apache spark only ( keeping pytorch and pandas out of execution line).&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 04 Dec 2024 07:18:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100856#M40444</guid>
      <dc:creator>adarsh8304</dc:creator>
      <dc:date>2024-12-04T07:18:51Z</dc:date>
    </item>
    <item>
      <title>Re: Does Databricks supports the Pytorch Distributed Training for multiple devices?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100859#M40446</link>
      <description>&lt;P&gt;If only the driver is active, this probably means you are not using Spark.&amp;nbsp; When running pure python,... code, the driver will execute that.&lt;BR /&gt;If Spark is active, workers receive their tasks from the driver.&amp;nbsp; Generally the driver is not that active, the workers do all the work.The driver machine is not designed in any way.&amp;nbsp; You can define yourself what kind of machine you use as a driver.&lt;BR /&gt;You can even run in single node mode, so you only have a driver (which also acts as a worker).&lt;/P&gt;</description>
      <pubDate>Wed, 04 Dec 2024 07:38:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-supports-the-pytorch-distributed-training-for/m-p/100859#M40446</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-12-04T07:38:16Z</dc:date>
    </item>
  </channel>
</rss>

