cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

TorchDistributor: installation of custom python package via wheel across all nodes in cluster

tooooods
New Contributor

I am trying to set up a training pipeline of a distributed PyTorch model using TorchDistributor. I have defined a train_object (in my case it is a Callable) that runs my training code. However, this method requires custom code from modules that I have written myself. I've packaged this code up into a wheel file and can install it via the Libraries API. I get a 200 code back from the POST, see that this has been successfully installed in my cluster's libraries tab (picture attached), and can also confirm installation via the `/api/2.0/libraries/cluster-status` endpoint. 

However, when I initiate a TorchDistributor run, I get `ModuleNotFoundError: No module named '<my_module>'`. I've tried using both relative and absolute imports to access my modules. I have also checked the site-packages/ and dist-packages/ directories in the workers and indeed my module doesn't seem to be installed there.

Am I doing something wrong here? How can I make this custom code available across all workers in my cluster?

Thanks! 

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group