cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.

KumarShiv
New Contributor III

I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC() function is not accurate . I have run the same query on MS SQL but getting different result set.

Here are both result sets for Pyspark and MS SQL

(1) Databricks Query and result set.

##================================

ringlist=[( "ring3", 1418),

("ring3", 8014),

("ring3", 4270)]

columns =("Ring","Value")

df1=spark.createDataFrame(ringlist, columns)

df1.createOrReplaceTempView("combineActionData")

df_Percentile = spark.sql("SELECT Ring\

,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile90 \

,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile70 \

,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile50 \

,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring ) AS Percentile30 \

FROM combineActionData\

" ).distinct().display()

##================================

(2) MS SQL Result set

//======================

Create Table TestRing

(

Ring Varchar(50),

value int

)

INSERT INTO TestRing

Values( 'ring3', 1418),

('ring3', 8014),

('ring3', 4270)

SELECT Ring

,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile90

,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile70

,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring ) AS Percentile50

,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile30

FROM TestRing

//=====================

Note: Output result set are attached in attachments.

Please anyone help me here, if any one have some idea.

1 ACCEPTED SOLUTION

Accepted Solutions

artsheiko
Databricks Employee
Databricks Employee
2 REPLIES 2

artsheiko
Databricks Employee
Databricks Employee

The reason might be that in SQL PERCENTILE_DISC is nondeterministic

Hi @Shiv Kumar​,

Just a friendly follow-up. Did you have time to check the docs shared by Artsheiko? let us know if you still have any follow-up questions

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group