Databricks Community

LaurentThiebaud · ‎10-07-2019

Using Spark DataFrame, eg.

myDf
  .filter(col("timestamp").gt(15000))
  .groupBy("groupingKey")
  .agg(collect_list("aDoubleValue"))

I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results to be sorted by another column.

I know there are other issues about it, but I couldn't find a reliable answer with DataFrame.

How can this be done? (the answer: sort the myDf by "timestamp" before the gorupBy is not good)

I already asked the question on stack-overflow, see https://stackoverflow.com/questions/58239182/spark-sort-within-a-groupby-with-dataframe?noredirect=1... but I'd like not to use a temporary structure (because there are many fields that I use in the group-by)

Thanks.

shyam_9 · ‎10-07-2019

Hi @Laurent Thiebaud,

Please use the below format to sort within a groupby,

import org.apache.spark.sql.functions._ 
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

Sort within a groupBy with dataframe