cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

collect_list by preserving order based on another variable - Spark SQL

Constantine
Contributor III

I am using databricks sql notebook to run these queries. 

I have a Python UDF like 

  

%python
 
 from pyspark.sql.functions import udf
 from pyspark.sql.types import StringType, DoubleType, DateType
 
 def get_sell_price(sale_prices):
     return sale_price[0] 
 
spark.udf.register("get_sell_price", get_sell_price, DoubleType()) 

This is running on a query like 

SELECT
  id,
  get_sell_price(sell_price)
FROM
  table_name
GROUP BY
  id
ORDER BY
  date;

I want the sell price inside the `collect_list` to be sorted based on the specified column, but even though I mention it in the query, it still doesn't maintain the order

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

@John Constantineโ€‹, I think you want to use a window function for this.

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

@John Constantineโ€‹ , "The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle." https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_list.htm...

Generally using collect_list in production is not the best solution. Usually, there are other ways to achieve what is needed.

-werners-
Esteemed Contributor III

@John Constantineโ€‹, I think you want to use a window function for this.

villi77
New Contributor II

I had a similar situation where I was trying to order the days of the week from Monday to Sunday.  I saw solutions that use Python but was wanting to do it all in SQL.  

My original attempt was to use: 

CONCAT_WS(',', COLLECT_LIST(DISTINCT t.LOAD_ORIG_DAY_BL))
 
But it did not order correctly and would not take an ORDER BY clause like the LIST_AGG in ORACLE.  Here is the replacement that works perfectly - The collect_list wrapped in a custom array_sort function...
     CONCAT_WS(',', ARRAY_SORT(COLLECT_LIST(DISTINCT t.LOAD_ORIG_DAY_BL),
                    (left, right) -> CASE
                                        WHEN left = 'MON' AND right IN ('TUE','WED','THU','FRI','SAT','SUN') THEN -1
                                        WHEN left = 'TUE' AND right IN ('WED','THU','FRI','SAT','SUN') THEN -1
                                        WHEN left = 'WED' AND right IN ('THU','FRI','SAT','SUN') THEN -1
                                        WHEN left = 'THU' AND right IN ('FRI','SAT','SUN') THEN -1
                                        WHEN left = 'FRI' AND right IN ('SAT','SUN') THEN -1
                                        WHEN left = 'SAT' AND right IN ('SUN') THEN -1                            
                                        ELSE 1
                                      END))

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group