cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

pandas udf type grouped map fails

user_b22ce5eeAl
New Contributor II

Hello,

I am trying to get the shap values for my whole dataset using pandas udf for each category of a categorical variable. It runs well when I run it on a few categories but when I want to run the function on the whole dataset my job fails. I see spills both on memory and disk and my shuffle read is around 40GB. I am not sure how to optimize my spark job here, I increased my cores to 160 and also Memory for both driver and workers but still not successful.

Any suggestion will be highly appreciated.

Thanks

2 REPLIES 2

user_b22ce5eeAl
New Contributor II

was able to get it done by increasing driver memory!

Jackson
New Contributor II

I want to use data.groupby.apply() to apply a function to each row of my Pyspark Dataframe per group.

I used The Grouped Map Pandas UDFs. However I can't figure out how to add another argument to my function. DGCustomerFirst Survey

I tried using the argument as a global variable but the function doesn't recongnize it ( my argument is a pyspark dataframe)

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!