collect_list by preserving order based on another variable - Spark SQL

Constantine
Contributor III

I am using databricks sql notebook to run these queries. 

I have a Python UDF like 

  

%python
 
 from pyspark.sql.functions import udf
 from pyspark.sql.types import StringType, DoubleType, DateType
 
 def get_sell_price(sale_prices):
     return sale_price[0] 
 
spark.udf.register("get_sell_price", get_sell_price, DoubleType()) 

This is running on a query like 

SELECT
  id,
  get_sell_price(sell_price)
FROM
  table_name
GROUP BY
  id
ORDER BY
  date;

I want the sell price inside the `collect_list` to be sorted based on the specified column, but even though I mention it in the query, it still doesn't maintain the order

Hubert-Dudek
Databricks MVP

@John Constantine​ , "The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle." https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_list.htm...

Generally using collect_list in production is not the best solution. Usually, there are other ways to achieve what is needed.


My blog: https://databrickster.medium.com/

-werners-
Esteemed Contributor III

@John Constantine​, I think you want to use a window function for this.

View solution in original post

villi77
New Contributor II

I had a similar situation where I was trying to order the days of the week from Monday to Sunday.  I saw solutions that use Python but was wanting to do it all in SQL.  

My original attempt was to use: 

CONCAT_WS(',', COLLECT_LIST(DISTINCT t.LOAD_ORIG_DAY_BL))
 
But it did not order correctly and would not take an ORDER BY clause like the LIST_AGG in ORACLE.  Here is the replacement that works perfectly - The collect_list wrapped in a custom array_sort function...
     CONCAT_WS(',', ARRAY_SORT(COLLECT_LIST(DISTINCT t.LOAD_ORIG_DAY_BL),
                    (left, right) -> CASE
                                        WHEN left = 'MON' AND right IN ('TUE','WED','THU','FRI','SAT','SUN') THEN -1
                                        WHEN left = 'TUE' AND right IN ('WED','THU','FRI','SAT','SUN') THEN -1
                                        WHEN left = 'WED' AND right IN ('THU','FRI','SAT','SUN') THEN -1
                                        WHEN left = 'THU' AND right IN ('FRI','SAT','SUN') THEN -1
                                        WHEN left = 'FRI' AND right IN ('SAT','SUN') THEN -1
                                        WHEN left = 'SAT' AND right IN ('SUN') THEN -1                            
                                        ELSE 1
                                      END))