cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Horovod Databricks Job - custom module not found error

Serhii
Contributor

We have used the following example to successfully create a distributed deep learning training notebook https://www.databricks.com/blog/2022/09/07/accelerating-your-deep-learning-pytorch-lightning-databri... that works as expected.

We now want to run this notebook as a task in the Job Compute Workflow, which essentially runs the same code but using Databricks jobs. This surprisingly gives us the error:

INFO:HorovodRunner:Start training.
Warning: Permanently added '172.17.131.218' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.17.162.215' (ECDSA) to the list of known hosts.
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "<string>", line 1, in <module>
[1,1]<stderr>:ModuleNotFoundError: No module named 'training'

The training here is the small python module file in the same folder which contains re-usable library functions. My guess is that the top level import code in the notebook is executed on worker node which may not have that file. But I am confused why this is happening:

  1. Shouldn't the horovod just work/pass to workers functions already loaded in the environment that are specifically provided in the call to HorovodRunner.run
  2. Why we don't see on interactive cluster that runs the same notebook

Thanks for your help

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Sergii Ivakhno​, What DBR version are you running for this notebook?

NOTE:- Make sure to create Databricks Runtime ML for the cluster and attach it to this notebook. (You cannot run this exercise in the standard Databricks runtime without "ML.")

ML runtime is optimized for deep learning, and all related components (TensorFlow, Horovod, Keras, XGBoost, etc.) are already built-in. (You don’t need to install these components by yourself.)

Built-in HorovodRunner on ML runtime helps Horovod to run on Apache Spark™. (Horovod (by Uber) has an efficient parameter-sharing mechanism and is beneficial for scaling.)

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Sergii Ivakhno​, What DBR version are you running for this notebook?

NOTE:- Make sure to create Databricks Runtime ML for the cluster and attach it to this notebook. (You cannot run this exercise in the standard Databricks runtime without "ML.")

ML runtime is optimized for deep learning, and all related components (TensorFlow, Horovod, Keras, XGBoost, etc.) are already built-in. (You don’t need to install these components by yourself.)

Built-in HorovodRunner on ML runtime helps Horovod to run on Apache Spark™. (Horovod (by Uber) has an efficient parameter-sharing mechanism and is beneficial for scaling.)

Kaniz
Community Manager
Community Manager

Hi @Sergii Ivakhno​ ​, We haven’t heard from you since the last response from me​ ​, and I was checking back to see if my suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.