Databricks Community

bd · ‎03-13-2023

I'm getting this Failure Reason on a fairly simple streaming job. I'm running the job in a notebook. The notebook relies on a python module that I'm syncing to DBFS with `dbx`.

Within the notebook generally, the module is available, i.e. `import mymodule` works, after I've set the python path with

```

import sys

sys.path.append('/dbfs/tmp/')

```

which, the location I'm syncing to. So far so good.

However, when I try to execute the cell with the streaming job, the Job fails

```

Job aborted due to stage failure: Task 4 in stage 56.0 failed 4 times, most recent failure: Lost task 4.3 in stage 56.0 (TID 1213) (ip-10-33-226-58.ec2.internal executor driver): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

File "/databricks/spark/python/pyspark/serializers.py", line 188, in _read_with_length

return self.loads(obj)

File "/databricks/spark/python/pyspark/serializers.py", line 540, in loads

return cloudpickle.loads(obj, encoding=encoding)

File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle.py", line 679, in subimport

__import__(name)

ModuleNotFoundError: mymodule

```

I would really like to understand what's happening here. I get that this is not necessarily an ideal or even a supported workflow, but it would be very useful to my understanding of the databricks platform to get some insight into why it is that the notebook itself is able to resolve the module, but the streaming job is not.

This is on a single-node personal cluster, fwiw.

Anonymous · ‎03-18-2023

Hi @Benjamin Dean

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

View solution in original post

Anonymous · ‎03-14-2023

@Benjamin Dean : I am giving you a few pointers to start thinking about this issue, you can go ahead and test out and implement the best that suits you.

It looks like the issue here is that the module mymodule is not available in the Spark executor's Python environment. When you run a Spark job, it runs in a separate process and environment than the notebook, and the environment is determined by the cluster configuration. Therefore, just setting the Python path in the notebook environment doesn't necessarily make the module available in the Spark environment.

Also, The error message you provided indicates that the Spark executor running the streaming job is not able to find the mymodule Python module. This can happen because the Spark context created by the streaming job is different from the Spark context created by the notebook environment. Specifically, the Python environment and the Python path might be different, which can result in the module not being found.

One way to make sure that the module is available in all the worker nodes is to include it in the Spark job's dependencies. You can do this by specifying the --py-files option when you run the streaming job. This option takes a comma-separated list of Python files to be distributed to the worker nodes. For example, if your module is in a file called mymodule.py and it is located in /dbfs/tmp/, you can run the streaming job with the following command:

spark-submit --py-files /dbfs/tmp/mymodule.py my_streaming_job.py

where my_streaming_job.py is the script that contains your streaming job. This will make sure that the module is available in all the worker nodes where the streaming job is executed.

Please let us know if this helps to think and you could try getting to making it work! Thanks.

bd · ‎03-14-2023

so i'm still a bit confused. Perhaps because it seems like databricks and spark mean different things when they say "Job".

In my case, `spark-submit` is not any part of the process at all, unless it's being abstracted by the notebook. I'm aware of how to use `spark-submit` to start a ... "job". But what I'm trying to do is invoke some code in a library which results in a streaming query, entirely from within a notebook.

Is there any documentation that can clarify the relationship between the notebook and the spark contexts? and the spark context and a streaming query?

Anonymous · ‎03-18-2023

Hi @Benjamin Dean

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!