hi @tooooods ,
This is a classic challenge in distributed computing, and your observation is spot on.
The ModuleNotFoundError on the workers, despite the UI and API showing the library as "Installed," is the key symptom. This happens because TorchDistributor launches new Python processes on the worker nodes, and those processes need to be able to find and import your custom module from their own environment.
The cluster's "Libraries" UI/API status often reflects the state of the driver node or the cluster's intended configuration, but it doesn't always guarantee immediate, successful installation across all worker filesystems, especially for libraries added to a running cluster.
Here are the two most reliable ways to solve this, from most recommended to "should-work":
Solution 1: Use a Cluster-Scoped Init Script (Most Robust)
This is the most reliable method to ensure your package is installed on every node (driver and all workers) before any other process starts.
-
Upload Your Wheel: Make sure your .whl file is in a location accessible to the cluster, like DBFS. For example: dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl.
-
Create the Init Script: Create a simple shell script. Let's call it install-my-module.sh.
-
Upload the Init Script: Save this script and upload it to DBFS (e.g., dbfs:/databricks/init_scripts/install-my-module.sh).
-
Configure the Cluster:
-
Go to your cluster's configuration page and click Edit.
-
Go to Advanced Options and click the Init Scripts tab.
-
In the "Destination" dropdown, select DBFS.
-
Provide the path to your script: dbfs:/databricks/init_scripts/install-my-module.sh.
-
Click Add.
-
Restart the Cluster: You must restart the cluster for the init script to take effect. On restart, this script will run on every node, guaranteeing your module is in the Python environment before TorchDistributor tries to use it.
Solution 2: Install as a Cluster Library (The "Intended" Way)
This is what you tried, but the key is to ensure it's done as part of the cluster's permanent configuration and that the cluster is restarted afterward. Installing a library via the API to an already running cluster can be unreliable for worker propagation.
-
Go to your cluster's configuration page.
-
Click the Libraries tab.
-
Click Install New.
-
For "Library Source," select DBFS/S3 (or "Upload" if you want to upload it directly).
-
Provide the full path to your .whl file (e.g., dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl).
-
Click Install.
-
The UI will show the library as "Installing" and will likely prompt you to restart the cluster. Do this.
This method should instruct Databricks to handle the distribution and installation of the wheel to all worker nodes upon startup. If this method still fails, fall back to Solution 1, which is more explicit and bypasses any potential propagation delays.
The Anti-Pattern: What Not to Do
Just for clarity, do not use %pip install in a notebook cell.
This installs the library only on the driver node and only for the current notebook session. The workers will have no knowledge of it, leading to the exact ModuleNotFoundError you are seeing.
I recommend trying the Cluster Init Script method first, as it's the most dependable solution for custom code in distributed workloads.