Databricks Community

tooooods · ‎02-05-2025

I am trying to set up a training pipeline of a distributed PyTorch model using TorchDistributor. I have defined a train_object (in my case it is a Callable) that runs my training code. However, this method requires custom code from modules that I have written myself. I've packaged this code up into a wheel file and can install it via the Libraries API. I get a 200 code back from the POST, see that this has been successfully installed in my cluster's libraries tab (picture attached), and can also confirm installation via the `/api/2.0/libraries/cluster-status` endpoint.

However, when I initiate a TorchDistributor run, I get `ModuleNotFoundError: No module named '<my_module>'`. I've tried using both relative and absolute imports to access my modules. I have also checked the site-packages/ and dist-packages/ directories in the workers and indeed my module doesn't seem to be installed there.

Am I doing something wrong here? How can I make this custom code available across all workers in my cluster?

Thanks!

stbjelcevic · Thursday

hi @tooooods ,

This is a classic challenge in distributed computing, and your observation is spot on.

The ModuleNotFoundError on the workers, despite the UI and API showing the library as "Installed," is the key symptom. This happens because TorchDistributor launches new Python processes on the worker nodes, and those processes need to be able to find and import your custom module from their own environment.

The cluster's "Libraries" UI/API status often reflects the state of the driver node or the cluster's intended configuration, but it doesn't always guarantee immediate, successful installation across all worker filesystems, especially for libraries added to a running cluster.

Here are the two most reliable ways to solve this, from most recommended to "should-work":

Solution 1: Use a Cluster-Scoped Init Script (Most Robust)

This is the most reliable method to ensure your package is installed on every node (driver and all workers) before any other process starts.

Upload Your Wheel: Make sure your .whl file is in a location accessible to the cluster, like DBFS. For example: dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl.

Create the Init Script: Create a simple shell script. Let's call it install-my-module.sh.

Bash 

#!/bin/bash
# Use -e to exit immediately if pip fails
set -e

# Install the wheel from DBFS
# Add --upgrade to ensure it installs the correct version
pip install /dbfs/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl --upgrade

Upload the Init Script: Save this script and upload it to DBFS (e.g., dbfs:/databricks/init_scripts/install-my-module.sh).
Configure the Cluster:
- Go to your cluster's configuration page and click Edit.
- Go to Advanced Options and click the Init Scripts tab.
- In the "Destination" dropdown, select DBFS.
- Provide the path to your script: dbfs:/databricks/init_scripts/install-my-module.sh.
- Click Add.
Restart the Cluster: You must restart the cluster for the init script to take effect. On restart, this script will run on every node, guaranteeing your module is in the Python environment before TorchDistributor tries to use it.

Solution 2: Install as a Cluster Library (The "Intended" Way)

This is what you tried, but the key is to ensure it's done as part of the cluster's permanent configuration and that the cluster is restarted afterward. Installing a library via the API to an already running cluster can be unreliable for worker propagation.

Go to your cluster's configuration page.
Click the Libraries tab.
Click Install New.
For "Library Source," select DBFS/S3 (or "Upload" if you want to upload it directly).
Provide the full path to your .whl file (e.g., dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl).
Click Install.
The UI will show the library as "Installing" and will likely prompt you to restart the cluster. Do this.

This method should instruct Databricks to handle the distribution and installation of the wheel to all worker nodes upon startup. If this method still fails, fall back to Solution 1, which is more explicit and bypasses any potential propagation delays.

The Anti-Pattern: What Not to Do

Just for clarity, do not use %pip install in a notebook cell.

# Do NOT do this for TorchDistributor
%pip install dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl

This installs the library only on the driver node and only for the current notebook session. The workers will have no knowledge of it, leading to the exact ModuleNotFoundError you are seeing.

I recommend trying the Cluster Init Script method first, as it's the most dependable solution for custom code in distributed workloads.