<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Problem with df.first() or collect() when collation different from UTF8_BINARY in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/problem-with-df-first-or-collect-when-collation-different-from/m-p/114803#M44949</link>
    <description>&lt;P&gt;I'm getting a error when I want to select the first() or collect() from a dataframe when using a collation different than UTF8_BINARY&lt;/P&gt;&lt;P&gt;Example that reproduces the issue :&lt;/P&gt;&lt;P&gt;This works :&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df_result&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;spark&lt;/SPAN&gt;&lt;SPAN&gt;.sql&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"""&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;SELECT&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;'en-us'&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;AS&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;ETLLanguageCode&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"""&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(df_result)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;collect&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt;().&lt;/SPAN&gt;&lt;SPAN&gt;asDict&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;When I run this :&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df_result&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;spark&lt;/SPAN&gt;&lt;SPAN&gt;.sql&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"""&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;SELECT&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;'en-us'&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;COLLATE&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;UTF8_LCASE&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;AS&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;ETLLanguageCode&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"""&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(df_result)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;collect&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt;().&lt;/SPAN&gt;&lt;SPAN&gt;asDict&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I'm getting an error because the first() is empty, the count from the df says 1&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;What can I do to resolve this ? My tables are all UTF8_LCASE for the strings.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Settings :&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;1-1 Worker&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;&lt;SPAN&gt;16-16&amp;nbsp;GB Memory&lt;/SPAN&gt;&lt;SPAN&gt;4-4&amp;nbsp;Cores&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;1 Driver&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;16&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;GB Memory,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;4&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Cores&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Runtime&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;16.3.x-scala2.12&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;Unity Catalog&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;Photon&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;Standard_D4ds_v5&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 08 Apr 2025 09:03:41 GMT</pubDate>
    <dc:creator>MDV</dc:creator>
    <dc:date>2025-04-08T09:03:41Z</dc:date>
    <item>
      <title>Problem with df.first() or collect() when collation different from UTF8_BINARY</title>
      <link>https://community.databricks.com/t5/data-engineering/problem-with-df-first-or-collect-when-collation-different-from/m-p/114803#M44949</link>
      <description>&lt;P&gt;I'm getting a error when I want to select the first() or collect() from a dataframe when using a collation different than UTF8_BINARY&lt;/P&gt;&lt;P&gt;Example that reproduces the issue :&lt;/P&gt;&lt;P&gt;This works :&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df_result&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;spark&lt;/SPAN&gt;&lt;SPAN&gt;.sql&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"""&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;SELECT&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;'en-us'&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;AS&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;ETLLanguageCode&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"""&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(df_result)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;collect&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt;().&lt;/SPAN&gt;&lt;SPAN&gt;asDict&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;When I run this :&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df_result&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;spark&lt;/SPAN&gt;&lt;SPAN&gt;.sql&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"""&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;SELECT&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;'en-us'&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;COLLATE&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;UTF8_LCASE&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;AS&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;ETLLanguageCode&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"""&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(df_result)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;collect&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;print&lt;/SPAN&gt;&lt;SPAN&gt;(df_result.&lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt;().&lt;/SPAN&gt;&lt;SPAN&gt;asDict&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I'm getting an error because the first() is empty, the count from the df says 1&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;What can I do to resolve this ? My tables are all UTF8_LCASE for the strings.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Settings :&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;1-1 Worker&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;&lt;SPAN&gt;16-16&amp;nbsp;GB Memory&lt;/SPAN&gt;&lt;SPAN&gt;4-4&amp;nbsp;Cores&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;1 Driver&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;16&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;GB Memory,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;4&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Cores&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Runtime&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;16.3.x-scala2.12&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;Unity Catalog&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;Photon&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;Standard_D4ds_v5&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 08 Apr 2025 09:03:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/problem-with-df-first-or-collect-when-collation-different-from/m-p/114803#M44949</guid>
      <dc:creator>MDV</dc:creator>
      <dc:date>2025-04-08T09:03:41Z</dc:date>
    </item>
    <item>
      <title>Re: Problem with df.first() or collect() when collation different from UTF8_BINARY</title>
      <link>https://community.databricks.com/t5/data-engineering/problem-with-df-first-or-collect-when-collation-different-from/m-p/114806#M44951</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102804"&gt;@MDV&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I guess the issue likely comes from how non-default collations like UTF8_LCASE behave during serialization when using first() or collect(). As a workaround wrap the value in a subquery and re-cast the collation back to UTF8_BINARY before accessing it:&lt;/P&gt;&lt;P&gt;df_result = spark.sql("""&lt;BR /&gt;&amp;nbsp; &amp;nbsp; SELECT ETLLanguageCodes COLLATE UTF8_BINARY AS ETLLanguageCode&lt;BR /&gt;&amp;nbsp; &amp;nbsp; FROM (&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; SELECT 'en-us' COLLATE UTF8_LCASE AS ETLLanguageCodes&lt;BR /&gt;&amp;nbsp; &amp;nbsp; ) temp&lt;BR /&gt;""")&lt;BR /&gt;print(df_result.collect())&lt;/P&gt;&lt;P&gt;If this works, it likely confirms the collation is affecting serialization.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Apr 2025 10:19:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/problem-with-df-first-or-collect-when-collation-different-from/m-p/114806#M44951</guid>
      <dc:creator>SP_6721</dc:creator>
      <dc:date>2025-04-08T10:19:35Z</dc:date>
    </item>
    <item>
      <title>Re: Problem with df.first() or collect() when collation different from UTF8_BINARY</title>
      <link>https://community.databricks.com/t5/data-engineering/problem-with-df-first-or-collect-when-collation-different-from/m-p/114821#M44958</link>
      <description>&lt;P&gt;That is what I'm doing now, but I can't image that it is the meaning to behave like this. I think this is a bug that needs to be fixed. This is just a mear example, it does this with dataframes coming from unity catalog where the collation is set on the table.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Apr 2025 11:58:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/problem-with-df-first-or-collect-when-collation-different-from/m-p/114821#M44958</guid>
      <dc:creator>MDV</dc:creator>
      <dc:date>2025-04-08T11:58:17Z</dc:date>
    </item>
  </channel>
</rss>

