cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

how to Calculate quantile on grouped data in spark Dataframe

dshosseinyousef
New Contributor II

I have the following sparkdataframe :

agent_id/ payment_amount

a /1000

b /1100

a /1100

a /1200

b /1200

b /1250

a /10000

b /9000

my desire output would be something like

<code>agen_id   95_quantile
  a          whatever is95 quantile for agent a payments
  b          whatever is95 quantile for agent b payments

for each group of agent_id i need to calculate the 0.95 quantile, i take the following approach:

<code>test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but i take the following error:

<code>'GroupedData' object has no attribute 'approxQuantile'

i need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

2 REPLIES 2

dshosseinyousef
New Contributor II

@bill i'd appreciate your help , as it is very crucial

Weiluo__David_R
New Contributor II

For those of you who haven't run into this SO thread http://stackoverflow.com/questions/39633614/calculate-quantile-on-grouped-data-in-spark-dataframe, it's pointed out there that one work-around is to use HIVE UDF "percentile_approx". Please see the accepted answer in that SO thread.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group