Sort within a groupBy with dataframe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-07-2019 12:01 AM
Using Spark DataFrame, eg.
myDf
.filter(col("timestamp").gt(15000))
.groupBy("groupingKey")
.agg(collect_list("aDoubleValue"))
I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results to be sorted by another column.
I know there are other issues about it, but I couldn't find a reliable answer with DataFrame.
How can this be done? (the answer: sort the myDf by "timestamp" before the gorupBy is not good)
I already asked the question on stack-overflow, see https://stackoverflow.com/questions/58239182/spark-sort-within-a-groupby-with-dataframe?noredirect=1... but I'd like not to use a temporary structure (because there are many fields that I use in the group-by)
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-07-2019 01:43 AM
Hi @Laurent Thiebaud,
Please use the below format to sort within a groupby,
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

