TorchDistributor: installation of custom python package via wheel across all nodes in cluster
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-05-2025 02:38 PM
I am trying to set up a training pipeline of a distributed PyTorch model using TorchDistributor. I have defined a train_object (in my case it is a Callable) that runs my training code. However, this method requires custom code from modules that I have written myself. I've packaged this code up into a wheel file and can install it via the Libraries API. I get a 200 code back from the POST, see that this has been successfully installed in my cluster's libraries tab (picture attached), and can also confirm installation via the `/api/2.0/libraries/cluster-status` endpoint.
However, when I initiate a TorchDistributor run, I get `ModuleNotFoundError: No module named '<my_module>'`. I've tried using both relative and absolute imports to access my modules. I have also checked the site-packages/ and dist-packages/ directories in the workers and indeed my module doesn't seem to be installed there.
Am I doing something wrong here? How can I make this custom code available across all workers in my cluster?
Thanks!

