cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

how to Calculate quantile on grouped data in spark Dataframe

dshosseinyousef
New Contributor II

I have the following sparkdataframe :

agent_id/ payment_amount

a /1000

b /1100

a /1100

a /1200

b /1200

b /1250

a /10000

b /9000

my desire output would be something like

<code>agen_id   95_quantile
  a          whatever is95 quantile for agent a payments
  b          whatever is95 quantile for agent b payments

for each group of agent_id i need to calculate the 0.95 quantile, i take the following approach:

<code>test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but i take the following error:

<code>'GroupedData' object has no attribute 'approxQuantile'

i need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

2 REPLIES 2

dshosseinyousef
New Contributor II

@bill i'd appreciate your help , as it is very crucial

Weiluo__David_R
New Contributor II

For those of you who haven't run into this SO thread http://stackoverflow.com/questions/39633614/calculate-quantile-on-grouped-data-in-spark-dataframe, it's pointed out there that one work-around is to use HIVE UDF "percentile_approx". Please see the accepted answer in that SO thread.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.