ModuleNotFound error when using transformWithStateInPandas via a class defined outside the notebook

VaDim
New Contributor III

As per Databricks documentation when I define the class that extends `StatefulProcessor` in a Notebook everything works ok however, execution fails with ModuleNotFound error as soon as the class definition is moved to a file (module) of it's own in a .py file outside of the notebook.

e.g.

Say I have the class in `/Workspace/python/module1/processor.py`

 

class Processor(StatefulProcessor):
...

and the notebook in `/Workspace/notebooks/notebook1.py`

import sys

sys.path.append(os.path.abspath("../python/"))

...

from module1.processor import Processor

df = df.groupBy("col1").transformWithStateInPandas(
statefulProcessor=Processor(),
outputStructType="...",
outputMode="append",
timeMode="ProcessingTime",
)
...

on execution it fails with:

STREAMING_PYTHON_RUNNER_INITIALIZATION_FAILURE
...
    return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'module1'

Environment: DataBricks Runtime 16.4

While searching for answers found this un-answered thread that sounds similar but related to applyInPandasWithState.

I tried:

  • different cluster access modes: standard, shared
  • pip install-ing the python files bundled as a wheel