11-04-2021 01:09 PM
I am using databricks sql notebook to run these queries.
I have a Python UDF like
%python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, DoubleType, DateType
def get_sell_price(sale_prices):
return sale_price[0]
spark.udf.register("get_sell_price", get_sell_price, DoubleType())
This is running on a query like
SELECT
id,
get_sell_price(sell_price)
FROM
table_name
GROUP BY
id
ORDER BY
date;
I want the sell price inside the `collect_list` to be sorted based on the specified column, but even though I mention it in the query, it still doesn't maintain the order
11-05-2021 06:04 AM
11-04-2021 01:45 PM
@John Constantine , "The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle." https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_list.htm...
Generally using collect_list in production is not the best solution. Usually, there are other ways to achieve what is needed.
11-05-2021 06:04 AM
10-14-2024 11:22 AM
I had a similar situation where I was trying to order the days of the week from Monday to Sunday. I saw solutions that use Python but was wanting to do it all in SQL.
My original attempt was to use:
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now