<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Cannot re-initialize CUDA in forked subprocess. in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/cannot-re-initialize-cuda-in-forked-subprocess/m-p/37513#M1950</link>
    <description>&lt;P&gt;This is the error I am getting :"&lt;SPAN&gt;RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method&lt;/SPAN&gt;". I am using 13.0nc12s_v3 Cluster.&lt;/P&gt;&lt;P&gt;I used this one :"&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; torch.multiprocessing &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; mp&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;mp.set_start_method&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'spawn'&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt; force=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; pytorch_lightning.callbacks &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; EarlyStopping&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;", but still getting the same issue. Any solution?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 12 Jul 2023 14:51:10 GMT</pubDate>
    <dc:creator>phdykd</dc:creator>
    <dc:date>2023-07-12T14:51:10Z</dc:date>
    <item>
      <title>Cannot re-initialize CUDA in forked subprocess.</title>
      <link>https://community.databricks.com/t5/machine-learning/cannot-re-initialize-cuda-in-forked-subprocess/m-p/37513#M1950</link>
      <description>&lt;P&gt;This is the error I am getting :"&lt;SPAN&gt;RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method&lt;/SPAN&gt;". I am using 13.0nc12s_v3 Cluster.&lt;/P&gt;&lt;P&gt;I used this one :"&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; torch.multiprocessing &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; mp&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;mp.set_start_method&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'spawn'&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt; force=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; pytorch_lightning.callbacks &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; EarlyStopping&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;", but still getting the same issue. Any solution?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Jul 2023 14:51:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/cannot-re-initialize-cuda-in-forked-subprocess/m-p/37513#M1950</guid>
      <dc:creator>phdykd</dc:creator>
      <dc:date>2023-07-12T14:51:10Z</dc:date>
    </item>
    <item>
      <title>Re: Cannot re-initialize CUDA in forked subprocess.</title>
      <link>https://community.databricks.com/t5/machine-learning/cannot-re-initialize-cuda-in-forked-subprocess/m-p/38186#M1983</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/61848"&gt;@phdykd&lt;/a&gt;,&lt;BR /&gt;Thank you for posting your question in the Databricks community.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;One approach is to include the start_method="fork" parameter in the spawn function call as follows: mp.spawn(*prev_args, start_method="fork"). Although this will work, it might raise a warning suggesting to use method (option 2 below).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Another recommended solution, according to PyTorch (link), is to use torch.multiprocessing.start_processes: torch.multiprocessing.start_processes(*prev_args, start_method="fork").&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;It's important to note that the above options are not compatible with CUDA (&lt;A href="https://github.com/Lightning-AI/lightning/blob/7767fd36b68b956ac5f81c713b9384e253f983aa/src/lightning_lite/strategies/launchers/multiprocessing.py#L187" target="_self"&gt;link&lt;/A&gt;, &lt;A href="https://github.com/pytorch/pytorch/blob/main/torch/multiprocessing/spawn.py#L173" target="_self"&gt;link&lt;/A&gt;). Hence, attempting to run any .cuda related commands will lead to failures.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;The viable solution that successfully resolves all of these issues is to utilize TorchDistributor(local_mode=True).&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Please refer to this &lt;A href="https://docs.databricks.com/machine-learning/train-model/pytorch.html" target="_self"&gt;Documentation&lt;/A&gt; for more details&lt;/P&gt;</description>
      <pubDate>Fri, 21 Jul 2023 19:59:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/cannot-re-initialize-cuda-in-forked-subprocess/m-p/38186#M1983</guid>
      <dc:creator>Kumaran</dc:creator>
      <dc:date>2023-07-21T19:59:16Z</dc:date>
    </item>
  </channel>
</rss>

