<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: TorchDistributor: installation of custom python package via wheel across all nodes in cluster in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/torchdistributor-installation-of-custom-python-package-via-wheel/m-p/138026#M4415</link>
    <description>&lt;P&gt;hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/148015"&gt;@tooooods&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;This is a classic challenge in distributed computing, and your observation is spot on.&lt;/P&gt;
&lt;P&gt;The ModuleNotFoundError on the workers, despite the UI and API showing the library as "Installed," is the key symptom. This happens because TorchDistributor launches new Python processes on the worker nodes, and those processes need to be able to find and import your custom module from their own environment.&lt;/P&gt;
&lt;P&gt;The cluster's "Libraries" UI/API status often reflects the state of the driver node or the cluster's intended configuration, but it doesn't always guarantee immediate, successful installation across all worker filesystems, especially for libraries added to a running cluster.&lt;/P&gt;
&lt;P&gt;Here are the two most reliable ways to solve this, from most recommended to "should-work":&lt;/P&gt;
&lt;H3&gt;Solution 1: Use a Cluster-Scoped Init Script (Most Robust)&lt;/H3&gt;
&lt;P&gt;This is the most reliable method to ensure your package is installed on &lt;STRONG&gt;every&lt;/STRONG&gt; node (driver and all workers) &lt;I&gt;before&lt;/I&gt; any other process starts.&lt;/P&gt;
&lt;OL start="1"&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Upload Your Wheel:&lt;/STRONG&gt; Make sure your &lt;CODE&gt;.whl&lt;/CODE&gt; file is in a location accessible to the cluster, like DBFS. For example: &lt;CODE&gt;dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Create the Init Script:&lt;/STRONG&gt; Create a simple shell script. Let's call it &lt;CODE&gt;install-my-module.sh&lt;/CODE&gt;.&lt;/P&gt;
&lt;DIV class="code-block ng-tns-c2373308345-46 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation"&gt;
&lt;DIV class="code-block-decoration header-formatted gds-title-s ng-tns-c2373308345-46 ng-star-inserted"&gt;&lt;SPAN class="ng-tns-c2373308345-46"&gt;Bash&lt;/SPAN&gt;
&lt;DIV class="buttons ng-tns-c2373308345-46 ng-star-inserted"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="formatted-code-block-internal-container ng-tns-c2373308345-46"&gt;
&lt;DIV class="animated-opacity ng-tns-c2373308345-46"&gt;
&lt;PRE class="ng-tns-c2373308345-46"&gt;&lt;CODE class="code-container formatted ng-tns-c2373308345-46" role="text" data-test-id="code-content"&gt;&lt;SPAN class="hljs-meta"&gt;#!/bin/bash&lt;/SPAN&gt;
&lt;SPAN class="hljs-comment"&gt;# Use -e to exit immediately if pip fails&lt;/SPAN&gt;
&lt;SPAN class="hljs-built_in"&gt;set&lt;/SPAN&gt; -e

&lt;SPAN class="hljs-comment"&gt;# Install the wheel from DBFS&lt;/SPAN&gt;
&lt;SPAN class="hljs-comment"&gt;# Add --upgrade to ensure it installs the correct version&lt;/SPAN&gt;
pip install /dbfs/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl --upgrade
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Upload the Init Script:&lt;/STRONG&gt; Save this script and upload it to DBFS (e.g., &lt;CODE&gt;dbfs:/databricks/init_scripts/install-my-module.sh&lt;/CODE&gt;).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Configure the Cluster:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P&gt;Go to your cluster's configuration page and click &lt;STRONG&gt;Edit&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Go to &lt;STRONG&gt;Advanced Options&lt;/STRONG&gt; and click the &lt;STRONG&gt;Init Scripts&lt;/STRONG&gt; tab.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;In the "Destination" dropdown, select &lt;STRONG&gt;DBFS&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Provide the path to your script: &lt;CODE&gt;dbfs:/databricks/init_scripts/install-my-module.sh&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Click &lt;STRONG&gt;Add&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Restart the Cluster:&lt;/STRONG&gt; You &lt;STRONG&gt;must restart the cluster&lt;/STRONG&gt; for the init script to take effect. On restart, this script will run on every node, guaranteeing your module is in the Python environment before &lt;CODE&gt;TorchDistributor&lt;/CODE&gt; tries to use it.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Solution 2: Install as a Cluster Library (The "Intended" Way)&lt;/H3&gt;
&lt;P&gt;This is what you tried, but the key is to ensure it's done as part of the cluster's &lt;I&gt;permanent configuration&lt;/I&gt; and that the cluster is &lt;STRONG&gt;restarted&lt;/STRONG&gt; afterward. Installing a library via the API to an &lt;I&gt;already running&lt;/I&gt; cluster can be unreliable for worker propagation.&lt;/P&gt;
&lt;OL start="1"&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Go to your cluster's configuration page.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Click the &lt;STRONG&gt;Libraries&lt;/STRONG&gt; tab.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Click &lt;STRONG&gt;Install New&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;For "Library Source," select &lt;STRONG&gt;DBFS/S3&lt;/STRONG&gt; (or "Upload" if you want to upload it directly).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Provide the full path to your &lt;CODE&gt;.whl&lt;/CODE&gt; file (e.g., &lt;CODE&gt;dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl&lt;/CODE&gt;).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Click &lt;STRONG&gt;Install&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;The UI will show the library as "Installing" and will likely prompt you to &lt;STRONG&gt;restart the cluster&lt;/STRONG&gt;. Do this.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;This method &lt;I&gt;should&lt;/I&gt; instruct Databricks to handle the distribution and installation of the wheel to all worker nodes upon startup. If this method still fails, fall back to Solution 1, which is more explicit and bypasses any potential propagation delays.&lt;/P&gt;
&lt;H3&gt;The Anti-Pattern: What Not to Do&lt;/H3&gt;
&lt;P&gt;Just for clarity, &lt;STRONG&gt;do not&lt;/STRONG&gt; use &lt;CODE&gt;%pip install&lt;/CODE&gt; in a notebook cell.&lt;/P&gt;
&lt;DIV class="code-block ng-tns-c2373308345-47 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation"&gt;
&lt;DIV class="formatted-code-block-internal-container ng-tns-c2373308345-47"&gt;
&lt;DIV class="animated-opacity ng-tns-c2373308345-47"&gt;
&lt;PRE class="ng-tns-c2373308345-47"&gt;&lt;CODE class="code-container formatted ng-tns-c2373308345-47" role="text" data-test-id="code-content"&gt;&lt;SPAN class="hljs-comment"&gt;# Do NOT do this for TorchDistributor&lt;/SPAN&gt;
%pip install dbfs:/FileStore/my_libs/my_module-&lt;SPAN class="hljs-number"&gt;0.1&lt;/SPAN&gt;&lt;SPAN class="hljs-number"&gt;.0&lt;/SPAN&gt;-py3-none-&lt;SPAN class="hljs-built_in"&gt;any&lt;/SPAN&gt;.whl
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;This installs the library &lt;I&gt;only&lt;/I&gt; on the driver node and &lt;I&gt;only&lt;/I&gt; for the current notebook session. The workers will have no knowledge of it, leading to the exact &lt;CODE&gt;ModuleNotFoundError&lt;/CODE&gt; you are seeing.&lt;/P&gt;
&lt;P&gt;I recommend trying the &lt;STRONG&gt;Cluster Init Script&lt;/STRONG&gt; method first, as it's the most dependable solution for custom code in distributed workloads.&lt;/P&gt;</description>
    <pubDate>Thu, 06 Nov 2025 19:04:00 GMT</pubDate>
    <dc:creator>stbjelcevic</dc:creator>
    <dc:date>2025-11-06T19:04:00Z</dc:date>
    <item>
      <title>TorchDistributor: installation of custom python package via wheel across all nodes in cluster</title>
      <link>https://community.databricks.com/t5/machine-learning/torchdistributor-installation-of-custom-python-package-via-wheel/m-p/109029#M3952</link>
      <description>&lt;P&gt;I am trying to set up a training pipeline of a distributed PyTorch model using TorchDistributor. I have defined a train_object (in my case it is a Callable) that runs my training code. However, this method requires custom code from modules that I have written myself. I've packaged this code up into a wheel file and can install it via the &lt;A href="https://docs.databricks.com/api/workspace/libraries/install" target="_self"&gt;Libraries API&lt;/A&gt;. I get a 200 code back from the POST, see that this has been successfully installed in my cluster's libraries tab (picture attached), and can also confirm installation via the `&lt;SPAN&gt;/api/2.0/libraries/cluster-status` endpoint.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;However, when I initiate a TorchDistributor run, I get `ModuleNotFoundError: No module named '&amp;lt;my_module&amp;gt;'`. I've tried using both relative and absolute imports to access my modules. I have also checked the site-packages/ and dist-packages/ directories in the workers and indeed my module doesn't seem to be installed there.&lt;BR /&gt;&lt;BR /&gt;Am I doing something wrong here? How can I make this custom code available across all workers in my cluster?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks!&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Feb 2025 22:38:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/torchdistributor-installation-of-custom-python-package-via-wheel/m-p/109029#M3952</guid>
      <dc:creator>tooooods</dc:creator>
      <dc:date>2025-02-05T22:38:29Z</dc:date>
    </item>
    <item>
      <title>Re: TorchDistributor: installation of custom python package via wheel across all nodes in cluster</title>
      <link>https://community.databricks.com/t5/machine-learning/torchdistributor-installation-of-custom-python-package-via-wheel/m-p/138026#M4415</link>
      <description>&lt;P&gt;hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/148015"&gt;@tooooods&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;This is a classic challenge in distributed computing, and your observation is spot on.&lt;/P&gt;
&lt;P&gt;The ModuleNotFoundError on the workers, despite the UI and API showing the library as "Installed," is the key symptom. This happens because TorchDistributor launches new Python processes on the worker nodes, and those processes need to be able to find and import your custom module from their own environment.&lt;/P&gt;
&lt;P&gt;The cluster's "Libraries" UI/API status often reflects the state of the driver node or the cluster's intended configuration, but it doesn't always guarantee immediate, successful installation across all worker filesystems, especially for libraries added to a running cluster.&lt;/P&gt;
&lt;P&gt;Here are the two most reliable ways to solve this, from most recommended to "should-work":&lt;/P&gt;
&lt;H3&gt;Solution 1: Use a Cluster-Scoped Init Script (Most Robust)&lt;/H3&gt;
&lt;P&gt;This is the most reliable method to ensure your package is installed on &lt;STRONG&gt;every&lt;/STRONG&gt; node (driver and all workers) &lt;I&gt;before&lt;/I&gt; any other process starts.&lt;/P&gt;
&lt;OL start="1"&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Upload Your Wheel:&lt;/STRONG&gt; Make sure your &lt;CODE&gt;.whl&lt;/CODE&gt; file is in a location accessible to the cluster, like DBFS. For example: &lt;CODE&gt;dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Create the Init Script:&lt;/STRONG&gt; Create a simple shell script. Let's call it &lt;CODE&gt;install-my-module.sh&lt;/CODE&gt;.&lt;/P&gt;
&lt;DIV class="code-block ng-tns-c2373308345-46 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation"&gt;
&lt;DIV class="code-block-decoration header-formatted gds-title-s ng-tns-c2373308345-46 ng-star-inserted"&gt;&lt;SPAN class="ng-tns-c2373308345-46"&gt;Bash&lt;/SPAN&gt;
&lt;DIV class="buttons ng-tns-c2373308345-46 ng-star-inserted"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="formatted-code-block-internal-container ng-tns-c2373308345-46"&gt;
&lt;DIV class="animated-opacity ng-tns-c2373308345-46"&gt;
&lt;PRE class="ng-tns-c2373308345-46"&gt;&lt;CODE class="code-container formatted ng-tns-c2373308345-46" role="text" data-test-id="code-content"&gt;&lt;SPAN class="hljs-meta"&gt;#!/bin/bash&lt;/SPAN&gt;
&lt;SPAN class="hljs-comment"&gt;# Use -e to exit immediately if pip fails&lt;/SPAN&gt;
&lt;SPAN class="hljs-built_in"&gt;set&lt;/SPAN&gt; -e

&lt;SPAN class="hljs-comment"&gt;# Install the wheel from DBFS&lt;/SPAN&gt;
&lt;SPAN class="hljs-comment"&gt;# Add --upgrade to ensure it installs the correct version&lt;/SPAN&gt;
pip install /dbfs/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl --upgrade
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Upload the Init Script:&lt;/STRONG&gt; Save this script and upload it to DBFS (e.g., &lt;CODE&gt;dbfs:/databricks/init_scripts/install-my-module.sh&lt;/CODE&gt;).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Configure the Cluster:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P&gt;Go to your cluster's configuration page and click &lt;STRONG&gt;Edit&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Go to &lt;STRONG&gt;Advanced Options&lt;/STRONG&gt; and click the &lt;STRONG&gt;Init Scripts&lt;/STRONG&gt; tab.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;In the "Destination" dropdown, select &lt;STRONG&gt;DBFS&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Provide the path to your script: &lt;CODE&gt;dbfs:/databricks/init_scripts/install-my-module.sh&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Click &lt;STRONG&gt;Add&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Restart the Cluster:&lt;/STRONG&gt; You &lt;STRONG&gt;must restart the cluster&lt;/STRONG&gt; for the init script to take effect. On restart, this script will run on every node, guaranteeing your module is in the Python environment before &lt;CODE&gt;TorchDistributor&lt;/CODE&gt; tries to use it.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Solution 2: Install as a Cluster Library (The "Intended" Way)&lt;/H3&gt;
&lt;P&gt;This is what you tried, but the key is to ensure it's done as part of the cluster's &lt;I&gt;permanent configuration&lt;/I&gt; and that the cluster is &lt;STRONG&gt;restarted&lt;/STRONG&gt; afterward. Installing a library via the API to an &lt;I&gt;already running&lt;/I&gt; cluster can be unreliable for worker propagation.&lt;/P&gt;
&lt;OL start="1"&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Go to your cluster's configuration page.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Click the &lt;STRONG&gt;Libraries&lt;/STRONG&gt; tab.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Click &lt;STRONG&gt;Install New&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;For "Library Source," select &lt;STRONG&gt;DBFS/S3&lt;/STRONG&gt; (or "Upload" if you want to upload it directly).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Provide the full path to your &lt;CODE&gt;.whl&lt;/CODE&gt; file (e.g., &lt;CODE&gt;dbfs:/FileStore/my_libs/my_module-0.1.0-py3-none-any.whl&lt;/CODE&gt;).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Click &lt;STRONG&gt;Install&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;The UI will show the library as "Installing" and will likely prompt you to &lt;STRONG&gt;restart the cluster&lt;/STRONG&gt;. Do this.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;This method &lt;I&gt;should&lt;/I&gt; instruct Databricks to handle the distribution and installation of the wheel to all worker nodes upon startup. If this method still fails, fall back to Solution 1, which is more explicit and bypasses any potential propagation delays.&lt;/P&gt;
&lt;H3&gt;The Anti-Pattern: What Not to Do&lt;/H3&gt;
&lt;P&gt;Just for clarity, &lt;STRONG&gt;do not&lt;/STRONG&gt; use &lt;CODE&gt;%pip install&lt;/CODE&gt; in a notebook cell.&lt;/P&gt;
&lt;DIV class="code-block ng-tns-c2373308345-47 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation"&gt;
&lt;DIV class="formatted-code-block-internal-container ng-tns-c2373308345-47"&gt;
&lt;DIV class="animated-opacity ng-tns-c2373308345-47"&gt;
&lt;PRE class="ng-tns-c2373308345-47"&gt;&lt;CODE class="code-container formatted ng-tns-c2373308345-47" role="text" data-test-id="code-content"&gt;&lt;SPAN class="hljs-comment"&gt;# Do NOT do this for TorchDistributor&lt;/SPAN&gt;
%pip install dbfs:/FileStore/my_libs/my_module-&lt;SPAN class="hljs-number"&gt;0.1&lt;/SPAN&gt;&lt;SPAN class="hljs-number"&gt;.0&lt;/SPAN&gt;-py3-none-&lt;SPAN class="hljs-built_in"&gt;any&lt;/SPAN&gt;.whl
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;This installs the library &lt;I&gt;only&lt;/I&gt; on the driver node and &lt;I&gt;only&lt;/I&gt; for the current notebook session. The workers will have no knowledge of it, leading to the exact &lt;CODE&gt;ModuleNotFoundError&lt;/CODE&gt; you are seeing.&lt;/P&gt;
&lt;P&gt;I recommend trying the &lt;STRONG&gt;Cluster Init Script&lt;/STRONG&gt; method first, as it's the most dependable solution for custom code in distributed workloads.&lt;/P&gt;</description>
      <pubDate>Thu, 06 Nov 2025 19:04:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/torchdistributor-installation-of-custom-python-package-via-wheel/m-p/138026#M4415</guid>
      <dc:creator>stbjelcevic</dc:creator>
      <dc:date>2025-11-06T19:04:00Z</dc:date>
    </item>
  </channel>
</rss>

