As per Databricks documentation when I define the class that extends `StatefulProcessor` in a Notebook everything works ok however, execution fails with ModuleNotFound error as soon as the class definition is moved to a file (module) of it's own in a .py file outside of the notebook.
e.g.
Say I have the class in `/Workspace/python/module1/processor.py`
class Processor(StatefulProcessor):
...
and the notebook in `/Workspace/notebooks/notebook1.py`
import sys
sys.path.append(os.path.abspath("../python/"))
...
from module1.processor import Processor
df = df.groupBy("col1").transformWithStateInPandas(
statefulProcessor=Processor(),
outputStructType="...",
outputMode="append",
timeMode="ProcessingTime",
)
...
on execution it fails with:
STREAMING_PYTHON_RUNNER_INITIALIZATION_FAILURE
...
return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'module1'
Environment: DataBricks Runtime 16.4
While searching for answers found this un-answered thread that sounds similar but related to applyInPandasWithState.
I tried:
- different cluster access modes: standard, shared
- pip install-ing the python files bundled as a wheel