Job aborted due to stage failure: ModuleNotFoundEr...

bd · ‎03-13-2023

I'm getting this Failure Reason on a fairly simple streaming job. I'm running the job in a notebook. The notebook relies on a python module that I'm syncing to DBFS with `dbx`.

Within the notebook generally, the module is available, i.e. `import mymodule` works, after I've set the python path with

```

import sys

sys.path.append('/dbfs/tmp/')

```

which, the location I'm syncing to. So far so good.

However, when I try to execute the cell with the streaming job, the Job fails

```

Job aborted due to stage failure: Task 4 in stage 56.0 failed 4 times, most recent failure: Lost task 4.3 in stage 56.0 (TID 1213) (ip-10-33-226-58.ec2.internal executor driver): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

File "/databricks/spark/python/pyspark/serializers.py", line 188, in _read_with_length

return self.loads(obj)

File "/databricks/spark/python/pyspark/serializers.py", line 540, in loads

return cloudpickle.loads(obj, encoding=encoding)

File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle.py", line 679, in subimport

__import__(name)

ModuleNotFoundError: mymodule

```

I would really like to understand what's happening here. I get that this is not necessarily an ideal or even a supported workflow, but it would be very useful to my understanding of the databricks platform to get some insight into why it is that the notebook itself is able to resolve the module, but the streaming job is not.

This is on a single-node personal cluster, fwiw.

Job aborted due to stage failure: ModuleNotFoundError