Hubert-Dudek
Databricks MVP

@John Constantine​ , "The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle." https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_list.htm...

Generally using collect_list in production is not the best solution. Usually, there are other ways to achieve what is needed.


My blog: https://databrickster.medium.com/