<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Horovod Databricks Job - custom module not found error in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/horovod-databricks-job-custom-module-not-found-error/m-p/24394#M16940</link>
    <description>&lt;P&gt;We have used the following example to successfully create a distributed deep learning training notebook &lt;A href="https://www.databricks.com/blog/2022/09/07/accelerating-your-deep-learning-pytorch-lightning-databricks.html" target="test_blank"&gt;https://www.databricks.com/blog/2022/09/07/accelerating-your-deep-learning-pytorch-lightning-databricks.html&lt;/A&gt; that works as expected.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We now want to run this notebook as a task in the Job Compute Workflow, which essentially runs the same code but using Databricks jobs. This surprisingly gives us the error:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;INFO:HorovodRunner:Start training.
Warning: Permanently added '172.17.131.218' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.17.162.215' (ECDSA) to the list of known hosts.
[1,1]&amp;lt;stderr&amp;gt;:Traceback (most recent call last):
[1,1]&amp;lt;stderr&amp;gt;:  File "&amp;lt;string&amp;gt;", line 1, in &amp;lt;module&amp;gt;
[1,1]&amp;lt;stderr&amp;gt;:ModuleNotFoundError: No module named 'training'&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The &lt;B&gt;training&lt;/B&gt; here is the small python module file in the same folder which contains re-usable library functions. My guess is that the top level import code in the notebook is executed on worker node which may not have that file. But I am confused why this is happening:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Shouldn't the horovod just work/pass to workers functions already loaded in the environment that are specifically provided in the call to HorovodRunner.run &lt;/LI&gt;&lt;LI&gt;Why we don't see on interactive cluster that runs the same notebook&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Thanks for your help&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 01 Nov 2022 17:13:46 GMT</pubDate>
    <dc:creator>Serhii</dc:creator>
    <dc:date>2022-11-01T17:13:46Z</dc:date>
    <item>
      <title>Horovod Databricks Job - custom module not found error</title>
      <link>https://community.databricks.com/t5/data-engineering/horovod-databricks-job-custom-module-not-found-error/m-p/24394#M16940</link>
      <description>&lt;P&gt;We have used the following example to successfully create a distributed deep learning training notebook &lt;A href="https://www.databricks.com/blog/2022/09/07/accelerating-your-deep-learning-pytorch-lightning-databricks.html" target="test_blank"&gt;https://www.databricks.com/blog/2022/09/07/accelerating-your-deep-learning-pytorch-lightning-databricks.html&lt;/A&gt; that works as expected.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We now want to run this notebook as a task in the Job Compute Workflow, which essentially runs the same code but using Databricks jobs. This surprisingly gives us the error:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;INFO:HorovodRunner:Start training.
Warning: Permanently added '172.17.131.218' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.17.162.215' (ECDSA) to the list of known hosts.
[1,1]&amp;lt;stderr&amp;gt;:Traceback (most recent call last):
[1,1]&amp;lt;stderr&amp;gt;:  File "&amp;lt;string&amp;gt;", line 1, in &amp;lt;module&amp;gt;
[1,1]&amp;lt;stderr&amp;gt;:ModuleNotFoundError: No module named 'training'&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The &lt;B&gt;training&lt;/B&gt; here is the small python module file in the same folder which contains re-usable library functions. My guess is that the top level import code in the notebook is executed on worker node which may not have that file. But I am confused why this is happening:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Shouldn't the horovod just work/pass to workers functions already loaded in the environment that are specifically provided in the call to HorovodRunner.run &lt;/LI&gt;&lt;LI&gt;Why we don't see on interactive cluster that runs the same notebook&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Thanks for your help&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Nov 2022 17:13:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/horovod-databricks-job-custom-module-not-found-error/m-p/24394#M16940</guid>
      <dc:creator>Serhii</dc:creator>
      <dc:date>2022-11-01T17:13:46Z</dc:date>
    </item>
  </channel>
</rss>

