Databricks Community

Orianh · ‎05-23-2022

Hello guys,

I'm building a python package that return 1 row from DF at a time inside data bricks environment.

To improve the performance of this package i used multiprocessing library in python,

I have background process that his whole purpose is to prepare chunks of data ( filter the big spark df and convert to pandas or list using collect) and push them to multi process queue for the main process.

Inside the sub-process I'm using pypsark.sql.functions module to filter, index and shuffle the big spark df, convert to pandas and push it to queue.

When i wrote all the objects inside a notebook, run all the cells and tested my object every thing went fine.

after downloading a wheel file and the package i created from pip and ran a function from the wheel file that use my package error is thrown and i cant understand why.

From my point of view, for some reason the sub-process is running in environment where its don't know pyspark.sql.functions.

attaching error i get from cluster stderr logs:

Hope you guys have any idea on how to overcome this error.

This will help a lot.

Thanks.

** If any information is missing please let me know and i will edit the question **

After more tries and test, I'm to run my object while downloading the package from pip, but when im sending my object to keras fit method the sub process cant find pyspark.sql.functions