pandas udf type grouped map fails
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-13-2021 07:07 AM
Hello,
I am trying to get the shap values for my whole dataset using pandas udf for each category of a categorical variable. It runs well when I run it on a few categories but when I want to run the function on the whole dataset my job fails. I see spills both on memory and disk and my shuffle read is around 40GB. I am not sure how to optimize my spark job here, I increased my cores to 160 and also Memory for both driver and workers but still not successful.
Any suggestion will be highly appreciated.
Thanks
- Labels:
-
Pandas udf
-
Shuffle
-
Spill
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-16-2021 07:23 AM
was able to get it done by increasing driver memory!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-16-2021 09:01 PM
I want to use data.groupby.apply() to apply a function to each row of my Pyspark Dataframe per group.
I used The Grouped Map Pandas UDFs. However I can't figure out how to add another argument to my function. DGCustomerFirst Survey
I tried using the argument as a global variable but the function doesn't recongnize it ( my argument is a pyspark dataframe)

