Pyspark RDD fails with pytest
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 07:25 AM
when I call RDD Apis during pytest, it seems like module "serializer.py" cannot find any other modules under pyspark.
I've already looked up on the internet, and it seems like pyspark modules are not properly importing other referring modules.
I see others who are experiencing a similar issue.
I tried making a whole spark package into a zip file and load it when creating a spark session using addPyFile() method, but no use unfortunately.
Anyone could help me out with this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-10-2023 07:09 AM
@hyunho lee : It sounds like you are encountering an issue with PySpark's serializer not being able to find the necessary modules during testing with Pytest. One solution you could try is to set the PYTHONPATH
environment variable to include the path to your PySpark installation before running Pytest. This can be done by adding the following line to your test script before running Pytest:
import os
os.environ['PYTHONPATH'] = '/path/to/pyspark'
Replace /path/to/pyspark with the actual path to your PySpark installation directory.
Another solution you could try is to use the PYSPARK_PYTHON environment variable to specify the Python executable to be used by PySpark. You can set this variable to the Python executable you used to install PySpark. For example:
import os
os.environ['PYSPARK_PYTHON'] = '/path/to/python'
Replace /path/to/python with the actual path to your Python executable.
I hope this helps! Let me know if you have any further questions.

