I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way
hiveContext.sql("select percentile_approx("Open_Rate",0.10) from myTable);
But I want to do it using Spark DataFrame for performance reasons.
Sample data set
|User ID|Open_Rate|
-------------------
|A1 |10.3 |
|B1 |4.04 |
|C1 |21.7 |
|D1 |18.6 |
I want to find out how many users fall into 10 percentile or 20 percentile and so on. I want to do something like this
df.select($"id",Percentile($"Open_Rate")).show