topic Problem with df.first() or collect() when collation different from UTF8_BINARY in Data Engineering

Problem with df.first() or collect() when collation different from UTF8_BINARY

MDV — Tue, 08 Apr 2025 09:03:41 GMT

I'm getting a error when I want to select the first() or collect() from a dataframe when using a collation different than UTF8_BINARY

Example that reproduces the issue :

This works :

df_result = spark.sql(f"""

SELECT 'en-us' AS ETLLanguageCode

""")

display(df_result)

print(df_result.collect())

print(df_result.first())

print(df_result.first().asDict())

When I run this :

df_result = spark.sql(f"""

SELECT 'en-us' COLLATE UTF8_LCASE AS ETLLanguageCode

""")

display(df_result)

print(df_result.collect())

print(df_result.first())

print(df_result.first().asDict())

I'm getting an error because the first() is empty, the count from the df says 1

What can I do to resolve this ? My tables are all UTF8_LCASE for the strings.

Settings :

1-1 Worker

16-16 GB Memory4-4 Cores

1 Driver

16 GB Memory, 4 Cores

Runtime

16.3.x-scala2.12

Unity Catalog

Photon

Standard_D4ds_v5

Re: Problem with df.first() or collect() when collation different from UTF8_BINARY

SP_6721 — Tue, 08 Apr 2025 10:19:35 GMT

Hi @MDV

I guess the issue likely comes from how non-default collations like UTF8_LCASE behave during serialization when using first() or collect(). As a workaround wrap the value in a subquery and re-cast the collation back to UTF8_BINARY before accessing it:

df_result = spark.sql("""
SELECT ETLLanguageCodes COLLATE UTF8_BINARY AS ETLLanguageCode
FROM (
SELECT 'en-us' COLLATE UTF8_LCASE AS ETLLanguageCodes
) temp
""")
print(df_result.collect())

If this works, it likely confirms the collation is affecting serialization.

Re: Problem with df.first() or collect() when collation different from UTF8_BINARY

MDV — Tue, 08 Apr 2025 11:58:17 GMT

That is what I'm doing now, but I can't image that it is the meaning to behave like this. I think this is a bug that needs to be fixed. This is just a mear example, it does this with dataframes coming from unity catalog where the collation is set on the table.