cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

pandas udf type grouped map fails

user_b22ce5eeAl
New Contributor II

Hello,

I am trying to get the shap values for my whole dataset using pandas udf for each category of a categorical variable. It runs well when I run it on a few categories but when I want to run the function on the whole dataset my job fails. I see spills both on memory and disk and my shuffle read is around 40GB. I am not sure how to optimize my spark job here, I increased my cores to 160 and also Memory for both driver and workers but still not successful.

Any suggestion will be highly appreciated.

Thanks

2 REPLIES 2

user_b22ce5eeAl
New Contributor II

was able to get it done by increasing driver memory!

Jackson
New Contributor II

I want to use data.groupby.apply() to apply a function to each row of my Pyspark Dataframe per group.

I used The Grouped Map Pandas UDFs. However I can't figure out how to add another argument to my function. DGCustomerFirst Survey

I tried using the argument as a global variable but the function doesn't recongnize it ( my argument is a pyspark dataframe)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group