cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

withColumnRenamed error on Unity Catalog 14.3 LTS

ksamborn
New Contributor II

Hi - 

 

We are migrating to Unity Catalog 14.3 LTS and have seen a change in behavior using withColumnRenamed.

There is an error COLUMN_ALREADY_EXISTS on the join key, even though the column being renamed is a different column.   The joined DataFrame does behave fine before the renaming.   

We use the join syntax using a join expression like this:

 

 

 

df_1 = spark.createDataFrame([(1, "a", 1)], ["key", "field1", "field2"])
df_2 = spark.createDataFrame([(1, "test", 1000)], ["key", "data1", "data2"])

# this fails with UC and works find with non-UC

join_df = df_1.join(df_2, df_1.key == df_2.key)
#join_df = join_df.drop(df_2.key)

rename_df = join_df.withColumnRenamed("data1", "rename_data1")

join_df.display()
rename_df.display()

 

 

We receive this error:

[COLUMN_ALREADY_EXISTS] The column `key` already exists. Consider to choose another name or rename the existing column. SQLSTATE: 42711

A few other observations:

- If we use syntax that specifies a string for the column name the behavior is fine, join_df2 = df_1.join(df_2, "key")

- if we uncomment the drop line above it's fine

- aliasing the two tables before the join doesn't work

- it works works fine in 13.3 LTS and appeared in 14.0 and later

 

Is there something we are doing wrong or should be doing differently?

 

Thanks for your help!

 

2 REPLIES 2

Palash01
Contributor

Hey @ksamborn 

I can think of 2 solutions:

  1. Rename the column in df_2 before joining:

 

df_1_alias = df_1.alias("t1")
df_2_alias = df_2.alias("t2")
join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key)
rename_df = join_df.withColumnRenamed("t2.data1", "rename_data1")

 

 2. Use aliases for the join tables before the join:

 

df_1_alias = df_1.alias("t1")
df_2_alias = df_2.alias("t2")
join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key)
rename_df = join_df.withColumnRenamed("t2.data1", "rename_data1")

 

Let us know if this works otherwise followups are appreciated. 

 

Leave a like if this helps! Kudos,
Palash

ksamborn
New Contributor II

Hi Palash

Thanks for your reply.

In short, renaming the column before the join and before the column rename works (as does selecting only the columns that aren't duplicate keys).

The aliasing does not work.   In fact, I've used the standard example from the Pyspark documentation and withColumnRenamed() doesn't work when I add the last two lines here:

from pyspark.sql.functions import col, desc
df = spark.createDataFrame(
    [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
df_as1 = df.alias("df_as1")
df_as2 = df.alias("df_as2")
joined_df = df_as1.join(df_as2, col("df_as1.name") == col("df_as2.name"), 'inner')

joined_df.select(
    "df_as1.name", "df_as2.name", "df_as2.age").sort(desc("df_as1.name")).show()

renamed_df = joined_df.withColumnRenamed("df_as1.age", "age_renamed")
renamed_df.show()

 

I get exceptions for both of the last two lines when running in a Azure Databricks 14.3 LTS notebook.

For what it’s worth, I did ask an AI what is going on here and received a response that the behavior of `withColumnRenamed()` changes when Unity Catalog is enabled because of data lineage tracking:

ksamborn_0-1708957061285.png

 

However I cannot find any official references to this.  (And this isn't exactly what we are experiencing.)

Does anyone know anything about this?   Thanks for your help.

Kevin

 

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.