Databricks Community

KumarShiv · ‎08-01-2022

I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC() function is not accurate . I have run the same query on MS SQL but getting different result set.

Here are both result sets for Pyspark and MS SQL

(1) Databricks Query and result set.

##================================

ringlist=[( "ring3", 1418),

("ring3", 8014),

("ring3", 4270)]

columns =("Ring","Value")

df1=spark.createDataFrame(ringlist, columns)

df1.createOrReplaceTempView("combineActionData")

df_Percentile = spark.sql("SELECT Ring\

,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile90 \

,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile70 \

,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring) AS Percentile50 \

,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )\

OVER ( PARTITION BY Ring ) AS Percentile30 \

FROM combineActionData\

" ).distinct().display()

##================================

(2) MS SQL Result set

//======================

Create Table TestRing

(

Ring Varchar(50),

value int

)

INSERT INTO TestRing

Values( 'ring3', 1418),

('ring3', 8014),

('ring3', 4270)

SELECT Ring

,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile90

,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile70

,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring ) AS Percentile50

,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile30

FROM TestRing

//=====================

Note: Output result set are attached in attachments.

Please anyone help me here, if any one have some idea.

artsheiko · ‎08-03-2022

The reason might be that in SQL PERCENTILE_DISC is nondeterministic

View solution in original post

artsheiko · ‎08-03-2022

The reason might be that in SQL PERCENTILE_DISC is nondeterministic

jose_gonzalez · ‎08-30-2022

Hi @Shiv Kumar,

Just a friendly follow-up. Did you have time to check the docs shared by Artsheiko? let us know if you still have any follow-up questions

Databricks Community

Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.

li.media.uploader-dialog.title

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April