cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Sort within a groupBy with dataframe

LaurentThiebaud
New Contributor

Using Spark DataFrame, eg.

myDf
  .filter(col("timestamp").gt(15000))
  .groupBy("groupingKey")
  .agg(collect_list("aDoubleValue"))

I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results to be sorted by another column.

I know there are other issues about it, but I couldn't find a reliable answer with DataFrame.

How can this be done? (the answer: sort the myDf by "timestamp" before the gorupBy is not good)

I already asked the question on stack-overflow, see https://stackoverflow.com/questions/58239182/spark-sort-within-a-groupby-with-dataframe?noredirect=1... but I'd like not to use a temporary structure (because there are many fields that I use in the group-by)

Thanks.

1 REPLY 1

shyam_9
Valued Contributor
Valued Contributor

Hi @Laurent Thiebaud,

Please use the below format to sort within a groupby,

import org.apache.spark.sql.functions._ 
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.