cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Problem with df.first() or collect() when collation different from UTF8_BINARY

MDV
New Contributor III

I'm getting a error when I want to select the first() or collect() from a dataframe when using a collation different than UTF8_BINARY

Example that reproduces the issue :

This works :

df_result = spark.sql(f"""
                        SELECT 'en-us' AS ETLLanguageCode
""")
display(df_result)
print(df_result.collect())
print(df_result.first())
print(df_result.first().asDict())
 
When I run this : 
 
df_result = spark.sql(f"""
                        SELECT 'en-us' COLLATE UTF8_LCASE AS ETLLanguageCode
""")
display(df_result)
print(df_result.collect())
print(df_result.first())
print(df_result.first().asDict())
 
I'm getting an error because the first() is empty, the count from the df says 1 
 
What can I do to resolve this ? My tables are all UTF8_LCASE for the strings.
 
Settings :
1-1 Worker
16-16 GB Memory4-4 Cores
1 Driver
16 GB Memory, 4 Cores
Runtime
16.3.x-scala2.12
Unity Catalog
Photon
Standard_D4ds_v5
2 REPLIES 2

SP_6721
New Contributor II

Hi @MDV 

I guess the issue likely comes from how non-default collations like UTF8_LCASE behave during serialization when using first() or collect(). As a workaround wrap the value in a subquery and re-cast the collation back to UTF8_BINARY before accessing it:

df_result = spark.sql("""
    SELECT ETLLanguageCodes COLLATE UTF8_BINARY AS ETLLanguageCode
    FROM (
        SELECT 'en-us' COLLATE UTF8_LCASE AS ETLLanguageCodes
    ) temp
""")
print(df_result.collect())

If this works, it likely confirms the collation is affecting serialization.

MDV
New Contributor III

That is what I'm doing now, but I can't image that it is the meaning to behave like this. I think this is a bug that needs to be fixed. This is just a mear example, it does this with dataframes coming from unity catalog where the collation is set on the table.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now