Databricks Community

MDV · 2 weeks ago

I'm getting a error when I want to select the first() or collect() from a dataframe when using a collation different than UTF8_BINARY

Example that reproduces the issue :

This works :

df_result = spark.sql(f"""

SELECT 'en-us' AS ETLLanguageCode

""")

display(df_result)

print(df_result.collect())

print(df_result.first())

print(df_result.first().asDict())

When I run this :

df_result = spark.sql(f"""

SELECT 'en-us' COLLATE UTF8_LCASE AS ETLLanguageCode

""")

display(df_result)

print(df_result.collect())

print(df_result.first())

print(df_result.first().asDict())

I'm getting an error because the first() is empty, the count from the df says 1

What can I do to resolve this ? My tables are all UTF8_LCASE for the strings.

Settings :

1-1 Worker

16-16 GB Memory4-4 Cores

1 Driver

16 GB Memory, 4 Cores

Runtime

16.3.x-scala2.12

Unity Catalog

Photon

Standard_D4ds_v5

SP_6721 · 2 weeks ago

Hi @MDV

I guess the issue likely comes from how non-default collations like UTF8_LCASE behave during serialization when using first() or collect(). As a workaround wrap the value in a subquery and re-cast the collation back to UTF8_BINARY before accessing it:

df_result = spark.sql("""
SELECT ETLLanguageCodes COLLATE UTF8_BINARY AS ETLLanguageCode
FROM (
SELECT 'en-us' COLLATE UTF8_LCASE AS ETLLanguageCodes
) temp
""")
print(df_result.collect())

If this works, it likely confirms the collation is affecting serialization.

MDV · 2 weeks ago

That is what I'm doing now, but I can't image that it is the meaning to behave like this. I think this is a bug that needs to be fixed. This is just a mear example, it does this with dataframes coming from unity catalog where the collation is set on the table.

Databricks Community

Problem with df.first() or collect() when collation different from UTF8_BINARY

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!