collect_set wired result when Proton enable

danny_edm — Sat, 20 Aug 2022 04:44:18 GMT

Cluster : DBR 10.4 LTS with proton

Sample schema

seq_no (decimal)

type (string)

Sample data

seq_no type

1 A

2 A

2 B

command : F.size(F.collect_set(F.col("type")).over(Window.partitionBy("seq_no"))))

The cluster with Proton yielded wire results, like the size of array > 2; while without proton the results were still good.

Currently, have to use workaround code with F.size(F.array_distinct(F.collect_list())))

topic collect_set wired result when Proton enable in Data Engineering