<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: ModuleNotFound error when using transformWithStateInPandas via a class defined outside the noteb in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/modulenotfound-error-when-using-transformwithstateinpandas-via-a/m-p/135703#M50405</link>
    <description>&lt;P&gt;This is no longer an issue; it must be some patch version of DBX Runtime 16.4 fixed it and it works now without doing any changes to original code.&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
    <pubDate>Wed, 22 Oct 2025 13:43:11 GMT</pubDate>
    <dc:creator>VaDim</dc:creator>
    <dc:date>2025-10-22T13:43:11Z</dc:date>
    <item>
      <title>ModuleNotFound error when using transformWithStateInPandas via a class defined outside the notebook</title>
      <link>https://community.databricks.com/t5/data-engineering/modulenotfound-error-when-using-transformwithstateinpandas-via-a/m-p/123493#M47022</link>
      <description>&lt;P&gt;As per Databricks documentation when I define the class that extends `StatefulProcessor` in a Notebook everything works ok however, execution fails with ModuleNotFound error as soon as the class definition is moved to a file (module) of it's own in a .py file outside of the notebook.&lt;/P&gt;&lt;P&gt;e.g.&lt;/P&gt;&lt;P&gt;Say I have the class in `/Workspace/python/module1/processor.py`&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;class &lt;/SPAN&gt;&lt;SPAN&gt;Processor&lt;/SPAN&gt;(StatefulProcessor):&lt;BR /&gt;    ...&lt;/PRE&gt;&lt;/DIV&gt;&lt;P&gt;and the notebook in `/Workspace/notebooks/notebook1.py`&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;import sys&lt;BR /&gt;&lt;BR /&gt;sys.path.append(os.path.abspath(&lt;SPAN&gt;"../python/"&lt;/SPAN&gt;))&lt;BR /&gt;&lt;BR /&gt;...&lt;BR /&gt;&lt;BR /&gt;from module1.processor import &lt;SPAN&gt;Processor&lt;BR /&gt;&lt;BR /&gt;df = df.groupBy("col1").transformWithStateInPandas(&lt;BR /&gt;   statefulProcessor=Processor(),&lt;BR /&gt;   outputStructType="...",&lt;BR /&gt;   outputMode="append",&lt;BR /&gt;   timeMode="ProcessingTime",&lt;BR /&gt;)&lt;BR /&gt;...&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&lt;SPAN&gt;on execution it fails with:&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;LI-CODE lang="markup"&gt;STREAMING_PYTHON_RUNNER_INITIALIZATION_FAILURE
...
    return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'module1'&lt;/LI-CODE&gt;&lt;P&gt;Environment: DataBricks Runtime 16.4&lt;BR /&gt;&lt;BR /&gt;While searching for answers found this un-answered thread that sounds similar but related to&amp;nbsp;&lt;SPAN&gt;&lt;A href="https://community.databricks.com/t5/data-engineering/module-not-found-when-using-applyinpandaswithstate-in-repos/td-p/53696" target="_self"&gt;applyInPandasWithState.&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I tried:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;different cluster access modes: standard, shared&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;pip install-ing the python files bundled as a wheel&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Tue, 01 Jul 2025 16:03:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/modulenotfound-error-when-using-transformwithstateinpandas-via-a/m-p/123493#M47022</guid>
      <dc:creator>VaDim</dc:creator>
      <dc:date>2025-07-01T16:03:46Z</dc:date>
    </item>
    <item>
      <title>Re: ModuleNotFound error when using transformWithStateInPandas via a class defined outside the noteb</title>
      <link>https://community.databricks.com/t5/data-engineering/modulenotfound-error-when-using-transformwithstateinpandas-via-a/m-p/135550#M50380</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/11189"&gt;@VaDim&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;Thanks for the detailed context — you’ve run into a common gotcha with how Python code is serialized and executed for stateful streaming on Databricks.&lt;/P&gt;
&lt;P&gt;Your&lt;FONT face="andale mono,times"&gt; sys.path.append&lt;/FONT&gt; only modifies the Python path on the driver node, but &lt;FONT face="andale mono,times"&gt;transformWithStateInPandas&lt;/FONT&gt; (like UDFs) executes its code on the worker nodes.&lt;/P&gt;
&lt;P&gt;When Spark serializes your &lt;FONT face="andale mono,times"&gt;Processor&lt;/FONT&gt; object to send to the workers, it uses &lt;FONT face="andale mono,times"&gt;cloudpickle&lt;/FONT&gt;. When the workers try to deserialize it, they fail with &lt;FONT face="andale mono,times"&gt;ModuleNotFoundError: No module named 'module1'&lt;/FONT&gt; because that Python file doesn't exist on their file system or in their &lt;FONT face="andale mono,times"&gt;PYTHONPATH&lt;/FONT&gt;.&lt;/P&gt;
&lt;P&gt;There are a couple of potential solutions here, one being slightly more involved than the other:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Install the code as a wheel file (recommended best practice)&lt;/STRONG&gt;&lt;BR /&gt;I know you mentioned you tried this, but the way it's installed is important here.&amp;nbsp;Running &lt;FONT face="andale mono,times" style="color: #1b3139;"&gt;%pip install&lt;/FONT&gt;&lt;SPAN&gt; in a notebook cell is not enough, as that often only installs on the driver or in the notebook's isolated environment.&amp;nbsp;You must install your package as a &lt;/SPAN&gt;&lt;SPAN&gt;Cluster Library&lt;/SPAN&gt;&lt;SPAN&gt; or a &lt;/SPAN&gt;&lt;SPAN&gt;Job Library&lt;/SPAN&gt;&lt;SPAN&gt; so that it is distributed and installed on &lt;/SPAN&gt;&lt;SPAN&gt;all worker nodes&lt;/SPAN&gt;&lt;SPAN&gt; before your code runs.&lt;BR /&gt;&lt;BR /&gt;Steps: create a wheel file, upload to DBFS or UC Volume, then install it on your cluster. (&lt;A href="https://docs.databricks.com/aws/en/libraries/cluster-libraries#install-a-library-on-a-cluster" target="_self"&gt;source)&lt;/A&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;STRONG&gt;Use&amp;nbsp;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;FONT face="andale mono,times"&gt;spark.sparkContext.addPyFile()&lt;BR /&gt;&lt;/FONT&gt;This is a "lighter" solution if you don't want to build a full wheel file. This command tells Spark to ship your Python file to every worker.&lt;BR /&gt;&lt;BR /&gt;Make sure your module file is accessible, for example, by uploading it to DBFS or using a Workspace path.In your notebook, before you define the streaming query, add the file to the SparkContext. Note:&lt;SPAN&gt; You must use the full, absolute path. For Workspace files, prepend &lt;/SPAN&gt;&lt;FONT face="andale mono,times" style="color: #1b3139;"&gt;/Workspace/&lt;/FONT&gt;&lt;BR /&gt;(&lt;A href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.addPyFile.html" target="_self"&gt;source 1&lt;/A&gt;) (&lt;A href="https://janetvn.medium.com/how-to-add-multiple-python-custom-modules-to-spark-job-6a8b943cdbbc" target="_self"&gt;source 2&lt;/A&gt;)&lt;FONT face="andale mono,times"&gt;&lt;BR /&gt;&lt;/FONT&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Tue, 21 Oct 2025 16:59:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/modulenotfound-error-when-using-transformwithstateinpandas-via-a/m-p/135550#M50380</guid>
      <dc:creator>stbjelcevic</dc:creator>
      <dc:date>2025-10-21T16:59:14Z</dc:date>
    </item>
    <item>
      <title>Re: ModuleNotFound error when using transformWithStateInPandas via a class defined outside the noteb</title>
      <link>https://community.databricks.com/t5/data-engineering/modulenotfound-error-when-using-transformwithstateinpandas-via-a/m-p/135703#M50405</link>
      <description>&lt;P&gt;This is no longer an issue; it must be some patch version of DBX Runtime 16.4 fixed it and it works now without doing any changes to original code.&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Oct 2025 13:43:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/modulenotfound-error-when-using-transformwithstateinpandas-via-a/m-p/135703#M50405</guid>
      <dc:creator>VaDim</dc:creator>
      <dc:date>2025-10-22T13:43:11Z</dc:date>
    </item>
  </channel>
</rss>

