<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Issue with running multiprocessing on databricks:
Python kernel is unresponsive error in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/issue-with-running-multiprocessing-on-databricks-python-kernel/m-p/4513#M206</link>
    <description>&lt;P&gt;This is because multiprocessing will not use the distributed framework of spark/databricks.&lt;/P&gt;&lt;P&gt;When you use that, your code will run on the driver only and the workers are not doing anything.&lt;/P&gt;&lt;P&gt;More info &lt;A href="https://stackoverflow.com/questions/68849916/parallelizing-python-code-on-azure-databricks" alt="https://stackoverflow.com/questions/68849916/parallelizing-python-code-on-azure-databricks" target="_blank"&gt;here&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;So you should use a spark-enabled ML library, like sparktorch.&lt;/P&gt;&lt;P&gt;Or do not use spark but Ray for example:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/machine-learning/ray-integration.html" alt="https://docs.databricks.com/machine-learning/ray-integration.html" target="_blank"&gt;https://docs.databricks.com/machine-learning/ray-integration.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 12 May 2023 08:26:15 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2023-05-12T08:26:15Z</dc:date>
    <item>
      <title>Issue with running multiprocessing on databricks:
Python kernel is unresponsive error</title>
      <link>https://community.databricks.com/t5/machine-learning/issue-with-running-multiprocessing-on-databricks-python-kernel/m-p/4512#M205</link>
      <description>&lt;P&gt;Hello, &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;My problem:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;I'm trying to run a pytorch code which include multiprocessing on databricks and mt code is crashing with the note: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Fatal error: The Python kernel is unresponsive.&lt;/P&gt;&lt;P&gt;The Python process exited with exit code 134 (SIGABRT: Aborted).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Closing down clientserver connection&lt;/P&gt;&lt;P&gt;Assertion failed: ok (src/mailbox.cpp:99)&lt;/P&gt;&lt;P&gt;Fatal Python error: Aborted&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;While trying to debug it, it seems like the code crashes when generating pytorch dataloader when num_workers &amp;gt; 0.&amp;nbsp;&lt;/P&gt;&lt;P&gt;If num_workers=0, the code runs fine.&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is the exact point of the crash:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;gt; /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/torch/utils/data/dataloader.py(1042)__init__()&lt;/P&gt;&lt;P&gt;   1040             #     before it starts, and __del__ tries to join but will get:&lt;/P&gt;&lt;P&gt;   1041             #     AssertionError: can only join a started process.&lt;/P&gt;&lt;P&gt;&lt;B&gt;-&amp;gt; 1042             w.start()&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Other things I tried :&amp;nbsp;&lt;/B&gt;&lt;/P&gt;&lt;P&gt;I tried to change the pin_memory parameter - it crashes&amp;nbsp;both when it is True and False.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Additionally, in another place of the code I have a multithreading process using Pool :&lt;/P&gt;&lt;P&gt;from multiprocessing.pool import Pool&lt;/P&gt;&lt;P&gt;And it also crashs&amp;nbsp;with the message that the python kernel is unresponsive.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Removing this process to a single thread, also seems to solve the issue.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I checked if this is a memory issue with monitoring the process and I haven't seen an issue there. I also tested the code with a larger cluster and it still crashes.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Additional&amp;nbsp;information:&amp;nbsp;&lt;/B&gt;&lt;/P&gt;&lt;P&gt;This is my cluster details: m52xlarge,&amp;nbsp;32&amp;nbsp;GB Memory,&amp;nbsp;8&amp;nbsp;Cores&lt;/P&gt;&lt;P&gt;Databricks runtime version:&amp;nbsp;11.2 (includes Apache Spark 3.3.0, Scala 2.12)&lt;/P&gt;&lt;P&gt;Python version: 3.9&lt;/P&gt;&lt;P&gt;Pytorch version :&amp;nbsp;2.0.1+cu117&lt;/P&gt;&lt;P&gt;(I tried different clusters with more memory and it happened with all of them)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any help with this will be appreciated&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":folded_hands:"&gt;🙏&lt;/span&gt; &lt;/P&gt;</description>
      <pubDate>Thu, 11 May 2023 14:57:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/issue-with-running-multiprocessing-on-databricks-python-kernel/m-p/4512#M205</guid>
      <dc:creator>956020</dc:creator>
      <dc:date>2023-05-11T14:57:40Z</dc:date>
    </item>
    <item>
      <title>Re: Issue with running multiprocessing on databricks:
Python kernel is unresponsive error</title>
      <link>https://community.databricks.com/t5/machine-learning/issue-with-running-multiprocessing-on-databricks-python-kernel/m-p/4513#M206</link>
      <description>&lt;P&gt;This is because multiprocessing will not use the distributed framework of spark/databricks.&lt;/P&gt;&lt;P&gt;When you use that, your code will run on the driver only and the workers are not doing anything.&lt;/P&gt;&lt;P&gt;More info &lt;A href="https://stackoverflow.com/questions/68849916/parallelizing-python-code-on-azure-databricks" alt="https://stackoverflow.com/questions/68849916/parallelizing-python-code-on-azure-databricks" target="_blank"&gt;here&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;So you should use a spark-enabled ML library, like sparktorch.&lt;/P&gt;&lt;P&gt;Or do not use spark but Ray for example:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/machine-learning/ray-integration.html" alt="https://docs.databricks.com/machine-learning/ray-integration.html" target="_blank"&gt;https://docs.databricks.com/machine-learning/ray-integration.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 12 May 2023 08:26:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/issue-with-running-multiprocessing-on-databricks-python-kernel/m-p/4513#M206</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-05-12T08:26:15Z</dc:date>
    </item>
  </channel>
</rss>

