Problem with df.first() or collect() when collation different from UTF8_BINARY
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
I'm getting a error when I want to select the first() or collect() from a dataframe when using a collation different than UTF8_BINARY
Example that reproduces the issue :
This works :
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago - last edited 2 weeks ago
Hi @MDV
I guess the issue likely comes from how non-default collations like UTF8_LCASE behave during serialization when using first() or collect(). As a workaround wrap the value in a subquery and re-cast the collation back to UTF8_BINARY before accessing it:
df_result = spark.sql("""
SELECT ETLLanguageCodes COLLATE UTF8_BINARY AS ETLLanguageCode
FROM (
SELECT 'en-us' COLLATE UTF8_LCASE AS ETLLanguageCodes
) temp
""")
print(df_result.collect())
If this works, it likely confirms the collation is affecting serialization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
That is what I'm doing now, but I can't image that it is the meaning to behave like this. I think this is a bug that needs to be fixed. This is just a mear example, it does this with dataframes coming from unity catalog where the collation is set on the table.

