I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC() function is not accurate . I have run the same query on MS SQL but getting different result set.
Here are both result sets for Pyspark and MS SQL
(1) Databricks Query and result set.
##================================
ringlist=[( "ring3", 1418),
("ring3", 8014),
("ring3", 4270)]
columns =("Ring","Value")
df1=spark.createDataFrame(ringlist, columns)
df1.createOrReplaceTempView("combineActionData")
df_Percentile = spark.sql("SELECT Ring\
,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )\
OVER ( PARTITION BY Ring) AS Percentile90 \
,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )\
OVER ( PARTITION BY Ring) AS Percentile70 \
,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )\
OVER ( PARTITION BY Ring) AS Percentile50 \
,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )\
OVER ( PARTITION BY Ring ) AS Percentile30 \
FROM combineActionData\
" ).distinct().display()
##================================
(2) MS SQL Result set
//======================
Create Table TestRing
(
Ring Varchar(50),
value int
)
INSERT INTO TestRing
Values( 'ring3', 1418),
('ring3', 8014),
('ring3', 4270)
SELECT Ring
,PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile90
,PERCENTILE_DISC(0.70) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile70
,PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring ) AS Percentile50
,PERCENTILE_DISC(0.30) WITHIN GROUP (ORDER BY Value )OVER ( PARTITION BY Ring) AS Percentile30
FROM TestRing
//=====================
Note: Output result set are attached in attachments.
Please anyone help me here, if any one have some idea.