We have used the following example to successfully create a distributed deep learning training notebook https://www.databricks.com/blog/2022/09/07/accelerating-your-deep-learning-pytorch-lightning-databri... that works as expected.
We now want to run this notebook as a task in the Job Compute Workflow, which essentially runs the same code but using Databricks jobs. This surprisingly gives us the error:
INFO:HorovodRunner:Start training.
Warning: Permanently added '172.17.131.218' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.17.162.215' (ECDSA) to the list of known hosts.
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "<string>", line 1, in <module>
[1,1]<stderr>:ModuleNotFoundError: No module named 'training'
The training here is the small python module file in the same folder which contains re-usable library functions. My guess is that the top level import code in the notebook is executed on worker node which may not have that file. But I am confused why this is happening:
- Shouldn't the horovod just work/pass to workers functions already loaded in the environment that are specifically provided in the call to HorovodRunner.run
- Why we don't see on interactive cluster that runs the same notebook
Thanks for your help