cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

ModuleNotFound error when using transformWithStateInPandas via a class defined outside the notebook

VaDim
New Contributor III

As per Databricks documentation when I define the class that extends `StatefulProcessor` in a Notebook everything works ok however, execution fails with ModuleNotFound error as soon as the class definition is moved to a file (module) of it's own in a .py file outside of the notebook.

e.g.

Say I have the class in `/Workspace/python/module1/processor.py`

 

class Processor(StatefulProcessor):
...

and the notebook in `/Workspace/notebooks/notebook1.py`

import sys

sys.path.append(os.path.abspath("../python/"))

...

from module1.processor import Processor

df = df.groupBy("col1").transformWithStateInPandas(
statefulProcessor=Processor(),
outputStructType="...",
outputMode="append",
timeMode="ProcessingTime",
)
...

on execution it fails with:

STREAMING_PYTHON_RUNNER_INITIALIZATION_FAILURE
...
    return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'module1'

Environment: DataBricks Runtime 16.4

While searching for answers found this un-answered thread that sounds similar but related to applyInPandasWithState.

I tried:

  • different cluster access modes: standard, shared
  • pip install-ing the python files bundled as a wheel
0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now