<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to calculate Percentile of column in a DataFrame in spark? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29666#M21377</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You can try using df.registerTempTable("tmp_tbl") val newDF = sql(/ do something with tmp_tbl /)// and continue using newDF Learn More&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Sat, 24 Sep 2016 08:56:39 GMT</pubDate>
    <dc:creator>amandaphy</dc:creator>
    <dc:date>2016-09-24T08:56:39Z</dc:date>
    <item>
      <title>How to calculate Percentile of column in a DataFrame in spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29663#M21374</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way &lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;hiveContext.sql("select percentile_approx("Open_Rate",0.10) from myTable); &lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;But I want to do it using Spark DataFrame for performance reasons. &lt;/P&gt;
&lt;P&gt; Sample data set &lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;    |User ID|Open_Rate|
    ------------------- 
    |A1     |10.3     |     
    |B1     |4.04     |     
    |C1     |21.7     |     
    |D1     |18.6     | &lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;I want to find out how many users fall into 10 percentile or 20 percentile and so on. I want to do something like this&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;df.select($"id",Percentile($"Open_Rate")).show&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2016 23:27:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29663#M21374</guid>
      <dc:creator>dheeraj</dc:creator>
      <dc:date>2016-06-07T23:27:41Z</dc:date>
    </item>
    <item>
      <title>Re: How to calculate Percentile of column in a DataFrame in spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29664#M21375</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;correction in the question above, i want to do something like this&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt; df.select($"id",Percentile($"Open_Rate",0.1)).show&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2016 23:29:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29664#M21375</guid>
      <dc:creator>dheeraj</dc:creator>
      <dc:date>2016-06-07T23:29:33Z</dc:date>
    </item>
    <item>
      <title>Re: How to calculate Percentile of column in a DataFrame in spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29665#M21376</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You could try to code your own version of this. It does not seem like this functionality is built into the Spark DataFrames. You may need to use the Window class in order to accomplish this. Here is a blog post with some details: &lt;A href="https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html" target="test_blank"&gt;https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Jun 2016 04:16:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29665#M21376</guid>
      <dc:creator>SiddSingal</dc:creator>
      <dc:date>2016-06-09T04:16:31Z</dc:date>
    </item>
    <item>
      <title>Re: How to calculate Percentile of column in a DataFrame in spark?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29666#M21377</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You can try using df.registerTempTable("tmp_tbl") val newDF = sql(/ do something with tmp_tbl /)// and continue using newDF Learn More&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Sep 2016 08:56:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-calculate-percentile-of-column-in-a-dataframe-in-spark/m-p/29666#M21377</guid>
      <dc:creator>amandaphy</dc:creator>
      <dc:date>2016-09-24T08:56:39Z</dc:date>
    </item>
  </channel>
</rss>

