cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

function does not exist in JVM ERROR

Orianh
Valued Contributor II

Hello guys,

I'm building a python package that return 1 row from DF at a time inside data bricks environment.

To improve the performance of this package i used multiprocessing library in python,

I have background process that his whole purpose is to prepare chunks of data ( filter the big spark df and convert to pandas or list using collect) and push them to multi process queue for the main process.

Inside the sub-process I'm using pypsark.sql.functions module to filter, index and shuffle the big spark df, convert to pandas and push it to queue.

When i wrote all the objects inside a notebook, run all the cells and tested my object every thing went fine.

after downloading a wheel file and the package i created from pip and ran a function from the wheel file that use my package error is thrown and i cant understand why.

From my point of view, for some reason the sub-process is running in environment where its don't know pyspark.sql.functions.

attaching error i get from cluster stderr logs:

function dont exist in JVM error. 

Hope you guys have any idea on how to overcome this error.

This will help a lot.

Thanks.

** If any information is missing please let me know and i will edit the question **

  • After more tries and test, I'm to run my object while downloading the package from pip, but when im sending my object to keras fit method the sub process cant find pyspark.sql.functions
4 REPLIES 4

Orianh
Valued Contributor II

Still didn't manage, If some one know how to fix it its will be really helpful.

Vickyster
New Contributor II

Hi @Orianh, have you managed to resolve it ? I'm facing the same issue.

Orianh
Valued Contributor II

Hey @Vigneshwaran Ramanathan​ , Nope.

After some tries and performance issues i just gave up on this approach 😅

I'm not sure how databricks runs a notebook cells, I think the use of spark and multi processing cause this error since spark use java under the hood

dineshreddy
New Contributor II

Using thread instead of processes solved the issue for me

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!