topic how to Calculate quantile on grouped data in spark Dataframe in Data Engineering

how to Calculate quantile on grouped data in spark Dataframe

dshosseinyousef — Thu, 22 Sep 2016 08:29:26 GMT

I have the following sparkdataframe :

agent_id/ payment_amount

a /1000

b /1100

a /1100

a /1200

b /1200

b /1250

a /10000

b /9000

my desire output would be something like

<code>agen_id   95_quantile
  a          whatever is95 quantile for agent a payments
  b          whatever is95 quantile for agent b payments

for each group of agent_id i need to calculate the 0.95 quantile, i take the following approach:

<code>test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but i take the following error:

<code>'GroupedData' object has no attribute 'approxQuantile'

i need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

dshosseinyousef — Thu, 22 Sep 2016 08:30:51 GMT

@bill i'd appreciate your help , as it is very crucial

Weiluo__David_R — Fri, 30 Dec 2016 18:17:54 GMT

For those of you who haven't run into this SO thread http://stackoverflow.com/questions/39633614/calculate-quantile-on-grouped-data-in-spark-dataframe, it's pointed out there that one work-around is to use HIVE UDF "percentile_approx". Please see the accepted answer in that SO thread.