<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Databricks Spark SQL function &amp;quot;PERCENTILE_DISC()&amp;quot; output not accurate. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-spark-sql-function-quot-percentile-disc-quot-output/m-p/11810#M6729</link>
    <description>&lt;P&gt;I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC()&amp;nbsp;function is not accurate . I have run the same query on MS SQL but getting different result set.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here are both result sets for Pyspark and MS SQL&lt;/P&gt;&lt;P&gt;&lt;B&gt;(1) Databricks Query and result set.&lt;/B&gt;&lt;/P&gt;&lt;P&gt;##================================&lt;/P&gt;&lt;P&gt;ringlist=[(	"ring3",	1418),&lt;/P&gt;&lt;P&gt;("ring3",	8014),&lt;/P&gt;&lt;P&gt;("ring3",	4270)]&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;columns =("Ring","Value")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df1=spark.createDataFrame(ringlist, columns)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df1.createOrReplaceTempView("combineActionData")&lt;/P&gt;&lt;P&gt;df_Percentile = spark.sql("SELECT Ring\&lt;/P&gt;&lt;P&gt; ,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )\&lt;/P&gt;&lt;P&gt; OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile90 \&lt;/P&gt;&lt;P&gt; ,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )\&lt;/P&gt;&lt;P&gt; OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile70 \&lt;/P&gt;&lt;P&gt; ,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )\&lt;/P&gt;&lt;P&gt; OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile50 \&lt;/P&gt;&lt;P&gt; ,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )\&lt;/P&gt;&lt;P&gt; OVER ( PARTITION BY&amp;nbsp;Ring ) AS Percentile30 \&lt;/P&gt;&lt;P&gt; FROM combineActionData\&lt;/P&gt;&lt;P&gt; " ).distinct().display()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;##================================&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;(2) MS SQL Result set&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;//======================&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Create Table TestRing&lt;/B&gt;&lt;/P&gt;&lt;P&gt;(&lt;/P&gt;&lt;P&gt;Ring Varchar(50),&lt;/P&gt;&lt;P&gt;value int&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;INSERT INTO TestRing &lt;/P&gt;&lt;P&gt;Values(	'ring3',	1418),&lt;/P&gt;&lt;P&gt;('ring3',	8014),&lt;/P&gt;&lt;P&gt;('ring3',	4270)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;SELECT Ring&lt;/P&gt;&lt;P&gt;,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile90&lt;/P&gt;&lt;P&gt;,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile70&lt;/P&gt;&lt;P&gt;,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY&amp;nbsp;Ring ) AS Percentile50&lt;/P&gt;&lt;P&gt;,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile30&lt;/P&gt;&lt;P&gt;FROM TestRing&lt;/P&gt;&lt;P&gt;&lt;B&gt;//=====================&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Note: Output result set are attached in attachments.&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please anyone help me here, if any one have some idea.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 01 Aug 2022 12:36:22 GMT</pubDate>
    <dc:creator>KumarShiv</dc:creator>
    <dc:date>2022-08-01T12:36:22Z</dc:date>
    <item>
      <title>Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-spark-sql-function-quot-percentile-disc-quot-output/m-p/11810#M6729</link>
      <description>&lt;P&gt;I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC()&amp;nbsp;function is not accurate . I have run the same query on MS SQL but getting different result set.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here are both result sets for Pyspark and MS SQL&lt;/P&gt;&lt;P&gt;&lt;B&gt;(1) Databricks Query and result set.&lt;/B&gt;&lt;/P&gt;&lt;P&gt;##================================&lt;/P&gt;&lt;P&gt;ringlist=[(	"ring3",	1418),&lt;/P&gt;&lt;P&gt;("ring3",	8014),&lt;/P&gt;&lt;P&gt;("ring3",	4270)]&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;columns =("Ring","Value")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df1=spark.createDataFrame(ringlist, columns)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df1.createOrReplaceTempView("combineActionData")&lt;/P&gt;&lt;P&gt;df_Percentile = spark.sql("SELECT Ring\&lt;/P&gt;&lt;P&gt; ,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )\&lt;/P&gt;&lt;P&gt; OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile90 \&lt;/P&gt;&lt;P&gt; ,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )\&lt;/P&gt;&lt;P&gt; OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile70 \&lt;/P&gt;&lt;P&gt; ,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )\&lt;/P&gt;&lt;P&gt; OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile50 \&lt;/P&gt;&lt;P&gt; ,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )\&lt;/P&gt;&lt;P&gt; OVER ( PARTITION BY&amp;nbsp;Ring ) AS Percentile30 \&lt;/P&gt;&lt;P&gt; FROM combineActionData\&lt;/P&gt;&lt;P&gt; " ).distinct().display()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;##================================&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;(2) MS SQL Result set&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;//======================&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Create Table TestRing&lt;/B&gt;&lt;/P&gt;&lt;P&gt;(&lt;/P&gt;&lt;P&gt;Ring Varchar(50),&lt;/P&gt;&lt;P&gt;value int&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;INSERT INTO TestRing &lt;/P&gt;&lt;P&gt;Values(	'ring3',	1418),&lt;/P&gt;&lt;P&gt;('ring3',	8014),&lt;/P&gt;&lt;P&gt;('ring3',	4270)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;SELECT Ring&lt;/P&gt;&lt;P&gt;,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile90&lt;/P&gt;&lt;P&gt;,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile70&lt;/P&gt;&lt;P&gt;,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY&amp;nbsp;Ring ) AS Percentile50&lt;/P&gt;&lt;P&gt;,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY&amp;nbsp;Ring) AS Percentile30&lt;/P&gt;&lt;P&gt;FROM TestRing&lt;/P&gt;&lt;P&gt;&lt;B&gt;//=====================&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Note: Output result set are attached in attachments.&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please anyone help me here, if any one have some idea.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Aug 2022 12:36:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-spark-sql-function-quot-percentile-disc-quot-output/m-p/11810#M6729</guid>
      <dc:creator>KumarShiv</dc:creator>
      <dc:date>2022-08-01T12:36:22Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-spark-sql-function-quot-percentile-disc-quot-output/m-p/11811#M6730</link>
      <description>&lt;P&gt;The reason might be that in SQL &lt;A href="https://docs.microsoft.com/en-us/sql/t-sql/functions/percentile-disc-transact-sql?view=sql-server-ver16#general-remarks" alt="https://docs.microsoft.com/en-us/sql/t-sql/functions/percentile-disc-transact-sql?view=sql-server-ver16#general-remarks" target="_blank"&gt;PERCENTILE_DISC is nondeterministic&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2022 18:17:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-spark-sql-function-quot-percentile-disc-quot-output/m-p/11811#M6730</guid>
      <dc:creator>artsheiko</dc:creator>
      <dc:date>2022-08-03T18:17:53Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-spark-sql-function-quot-percentile-disc-quot-output/m-p/11812#M6731</link>
      <description>&lt;P&gt;Hi @Shiv Kumar​,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just a friendly follow-up. Did you have time to check the docs shared by Artsheiko? let us know if you still have any follow-up questions&lt;/P&gt;</description>
      <pubDate>Tue, 30 Aug 2022 17:09:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-spark-sql-function-quot-percentile-disc-quot-output/m-p/11812#M6731</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-08-30T17:09:00Z</dc:date>
    </item>
  </channel>
</rss>

