<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic collect_set wired result when Proton enable in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/collect-set-wired-result-when-proton-enable/m-p/34164#M24942</link>
    <description>&lt;P&gt;Cluster : DBR 10.4 LTS with proton&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sample schema&lt;/P&gt;&lt;P&gt;seq_no (decimal)&lt;/P&gt;&lt;P&gt;type (string)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sample data&lt;/P&gt;&lt;P&gt;seq_no type&lt;/P&gt;&lt;P&gt;1             A&lt;/P&gt;&lt;P&gt;1             A&lt;/P&gt;&lt;P&gt;2            A&lt;/P&gt;&lt;P&gt;2            B&lt;/P&gt;&lt;P&gt;2            B&lt;/P&gt;&lt;P&gt;command : F.size(F.collect_set(F.col("type")).over(Window.partitionBy("seq_no"))))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The cluster with Proton yielded wire results, like the size of array &amp;gt; 2; while without proton the results were still good.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt; Currently, have to use workaround code with F.size(F.array_distinct(F.collect_list())))&lt;/P&gt;</description>
    <pubDate>Sat, 20 Aug 2022 04:44:18 GMT</pubDate>
    <dc:creator>danny_edm</dc:creator>
    <dc:date>2022-08-20T04:44:18Z</dc:date>
    <item>
      <title>collect_set wired result when Proton enable</title>
      <link>https://community.databricks.com/t5/data-engineering/collect-set-wired-result-when-proton-enable/m-p/34164#M24942</link>
      <description>&lt;P&gt;Cluster : DBR 10.4 LTS with proton&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sample schema&lt;/P&gt;&lt;P&gt;seq_no (decimal)&lt;/P&gt;&lt;P&gt;type (string)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sample data&lt;/P&gt;&lt;P&gt;seq_no type&lt;/P&gt;&lt;P&gt;1             A&lt;/P&gt;&lt;P&gt;1             A&lt;/P&gt;&lt;P&gt;2            A&lt;/P&gt;&lt;P&gt;2            B&lt;/P&gt;&lt;P&gt;2            B&lt;/P&gt;&lt;P&gt;command : F.size(F.collect_set(F.col("type")).over(Window.partitionBy("seq_no"))))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The cluster with Proton yielded wire results, like the size of array &amp;gt; 2; while without proton the results were still good.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt; Currently, have to use workaround code with F.size(F.array_distinct(F.collect_list())))&lt;/P&gt;</description>
      <pubDate>Sat, 20 Aug 2022 04:44:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/collect-set-wired-result-when-proton-enable/m-p/34164#M24942</guid>
      <dc:creator>danny_edm</dc:creator>
      <dc:date>2022-08-20T04:44:18Z</dc:date>
    </item>
  </channel>
</rss>

