cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Sort within a groupBy with dataframe

LaurentThiebaud
New Contributor

Using Spark DataFrame, eg.

myDf
  .filter(col("timestamp").gt(15000))
  .groupBy("groupingKey")
  .agg(collect_list("aDoubleValue"))

I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results to be sorted by another column.

I know there are other issues about it, but I couldn't find a reliable answer with DataFrame.

How can this be done? (the answer: sort the myDf by "timestamp" before the gorupBy is not good)

I already asked the question on stack-overflow, see https://stackoverflow.com/questions/58239182/spark-sort-within-a-groupby-with-dataframe?noredirect=1... but I'd like not to use a temporary structure (because there are many fields that I use in the group-by)

Thanks.

1 REPLY 1

shyam_9
Databricks Employee
Databricks Employee

Hi @Laurent Thiebaud,

Please use the below format to sort within a groupby,

import org.apache.spark.sql.functions._ 
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))