Cluster : DBR 10.4 LTS with proton
Sample schema
seq_no (decimal)
type (string)
Sample data
seq_no type
1 A
1 A
2 A
2 B
2 B
command : F.size(F.collect_set(F.col("type")).over(Window.partitionBy("seq_no"))))
The cluster with Proton yielded wire results, like the size of array > 2; while without proton the results were still good.
Currently, have to use workaround code with F.size(F.array_distinct(F.collect_list())))