05-23-2022 03:33 AM
Hello guys,
I'm building a python package that return 1 row from DF at a time inside data bricks environment.
To improve the performance of this package i used multiprocessing library in python,
I have background process that his whole purpose is to prepare chunks of data ( filter the big spark df and convert to pandas or list using collect) and push them to multi process queue for the main process.
Inside the sub-process I'm using pypsark.sql.functions module to filter, index and shuffle the big spark df, convert to pandas and push it to queue.
When i wrote all the objects inside a notebook, run all the cells and tested my object every thing went fine.
after downloading a wheel file and the package i created from pip and ran a function from the wheel file that use my package error is thrown and i cant understand why.
From my point of view, for some reason the sub-process is running in environment where its don't know pyspark.sql.functions.
attaching error i get from cluster stderr logs:
Hope you guys have any idea on how to overcome this error.
This will help a lot.
Thanks.
** If any information is missing please let me know and i will edit the question **
05-30-2022 02:18 AM
Still didn't manage, If some one know how to fix it its will be really helpful.
10-13-2022 12:01 AM
Hi @Orianh, have you managed to resolve it ? I'm facing the same issue.
10-25-2022 10:05 AM
Hey @Vigneshwaran Ramanathan , Nope.
After some tries and performance issues i just gave up on this approach 😅
I'm not sure how databricks runs a notebook cells, I think the use of spark and multi processing cause this error since spark use java under the hood
06-27-2023 05:37 PM
Using thread instead of processes solved the issue for me
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group