cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Sort within a groupBy with dataframe

LaurentThiebaud
New Contributor

Using Spark DataFrame, eg.

myDf
  .filter(col("timestamp").gt(15000))
  .groupBy("groupingKey")
  .agg(collect_list("aDoubleValue"))

I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results to be sorted by another column.

I know there are other issues about it, but I couldn't find a reliable answer with DataFrame.

How can this be done? (the answer: sort the myDf by "timestamp" before the gorupBy is not good)

I already asked the question on stack-overflow, see https://stackoverflow.com/questions/58239182/spark-sort-within-a-groupby-with-dataframe?noredirect=1... but I'd like not to use a temporary structure (because there are many fields that I use in the group-by)

Thanks.

1 REPLY 1

shyam_9
Databricks Employee
Databricks Employee

Hi @Laurent Thiebaud,

Please use the below format to sort within a groupby,

import org.apache.spark.sql.functions._ 
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group