Databricks Community

user_b22ce5eeAl · ‎08-13-2021

Hello,

I am trying to get the shap values for my whole dataset using pandas udf for each category of a categorical variable. It runs well when I run it on a few categories but when I want to run the function on the whole dataset my job fails. I see spills both on memory and disk and my shuffle read is around 40GB. I am not sure how to optimize my spark job here, I increased my cores to 160 and also Memory for both driver and workers but still not successful.

Any suggestion will be highly appreciated.

Thanks

user_b22ce5eeAl · ‎08-16-2021

was able to get it done by increasing driver memory!

Jackson · ‎08-16-2021

I want to use data.groupby.apply() to apply a function to each row of my Pyspark Dataframe per group.

I used The Grouped Map Pandas UDFs. However I can't figure out how to add another argument to my function. DGCustomerFirst Survey

I tried using the argument as a global variable but the function doesn't recongnize it ( my argument is a pyspark dataframe)

Databricks Community

pandas udf type grouped map fails

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon