cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.

KumarShiv
New Contributor III

I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC() function is not accurate . I have run the same query on MS SQL but getting different result set.

Here are both result sets for Pyspark and MS SQL

(1) Databricks Query and result set.

##================================

ringlist=[( "ring3", 1418),

("ring3", 8014),

("ring3", 4270)]

columns =("Ring","Value")

df1=spark.createDataFrame(ringlist, columns)

df1.createOrReplaceTempView("combineActionData")

df_Percentile = spark.sql("SELECT Ring\

,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile90 \

,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile70 \

,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile50 \

,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring ) AS Percentile30 \

FROM combineActionData\

" ).distinct().display()

##================================

(2) MS SQL Result set

//======================

Create Table TestRing

(

Ring Varchar(50),

value int

)

INSERT INTO TestRing

Values( 'ring3', 1418),

('ring3', 8014),

('ring3', 4270)

SELECT Ring

,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile90

,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile70

,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring ) AS Percentile50

,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile30

FROM TestRing

//=====================

Note: Output result set are attached in attachments.

Please anyone help me here, if any one have some idea.

1 ACCEPTED SOLUTION

Accepted Solutions

artsheiko
Valued Contributor II
Valued Contributor II
2 REPLIES 2

artsheiko
Valued Contributor II
Valued Contributor II

The reason might be that in SQL PERCENTILE_DISC is nondeterministic

Hi @Shiv Kumarโ€‹,

Just a friendly follow-up. Did you have time to check the docs shared by Artsheiko? let us know if you still have any follow-up questions

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.