05-23-2022 03:33 AM
Hello guys,
I'm building a python package that return 1 row from DF at a time inside data bricks environment.
To improve the performance of this package i used multiprocessing library in python,
I have background process that his whole purpose is to prepare chunks of data ( filter the big spark df and convert to pandas or list using collect) and push them to multi process queue for the main process.
Inside the sub-process I'm using pypsark.sql.functions module to filter, index and shuffle the big spark df, convert to pandas and push it to queue.
When i wrote all the objects inside a notebook, run all the cells and tested my object every thing went fine.
after downloading a wheel file and the package i created from pip and ran a function from the wheel file that use my package error is thrown and i cant understand why.
From my point of view, for some reason the sub-process is running in environment where its don't know pyspark.sql.functions.
attaching error i get from cluster stderr logs:
Hope you guys have any idea on how to overcome this error.
This will help a lot.
Thanks.
** If any information is missing please let me know and i will edit the question **
05-30-2022 02:18 AM
Still didn't manage, If some one know how to fix it its will be really helpful.
10-13-2022 12:01 AM
Hi @Orianh, have you managed to resolve it ? I'm facing the same issue.
10-25-2022 10:05 AM
Hey @Vigneshwaran Ramanathan , Nope.
After some tries and performance issues i just gave up on this approach 😅
I'm not sure how databricks runs a notebook cells, I think the use of spark and multi processing cause this error since spark use java under the hood
06-27-2023 05:37 PM
Using thread instead of processes solved the issue for me
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now