Databricks Community

ksamborn · ‎02-23-2024

Hi -

We are migrating to Unity Catalog 14.3 LTS and have seen a change in behavior using withColumnRenamed.

There is an error COLUMN_ALREADY_EXISTS on the join key, even though the column being renamed is a different column. The joined DataFrame does behave fine before the renaming.

We use the join syntax using a join expression like this:

df_1 = spark.createDataFrame([(1, "a", 1)], ["key", "field1", "field2"])
df_2 = spark.createDataFrame([(1, "test", 1000)], ["key", "data1", "data2"])

# this fails with UC and works find with non-UC

join_df = df_1.join(df_2, df_1.key == df_2.key)
#join_df = join_df.drop(df_2.key)

rename_df = join_df.withColumnRenamed("data1", "rename_data1")

join_df.display()
rename_df.display()

We receive this error:

[COLUMN_ALREADY_EXISTS] The column `key` already exists. Consider to choose another name or rename the existing column. SQLSTATE: 42711

A few other observations:

- If we use syntax that specifies a string for the column name the behavior is fine, join_df2 = df_1.join(df_2, "key")

- if we uncomment the drop line above it's fine

- aliasing the two tables before the join doesn't work

- it works works fine in 13.3 LTS and appeared in 14.0 and later

Is there something we are doing wrong or should be doing differently?

Thanks for your help!

Palash01 · ‎02-25-2024

Hey @ksamborn

I can think of 2 solutions:

Rename the column in df_2 before joining:

df_1_alias = df_1.alias("t1")
df_2_alias = df_2.alias("t2")
join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key)
rename_df = join_df.withColumnRenamed("t2.data1", "rename_data1")

2. Use aliases for the join tables before the join:

df_1_alias = df_1.alias("t1")
df_2_alias = df_2.alias("t2")
join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key)
rename_df = join_df.withColumnRenamed("t2.data1", "rename_data1")

Let us know if this works otherwise followups are appreciated.

Leave a like if this helps! Kudos,
Palash

ksamborn · ‎02-26-2024

Hi Palash

Thanks for your reply.

In short, renaming the column before the join and before the column rename works (as does selecting only the columns that aren't duplicate keys).

The aliasing does not work. In fact, I've used the standard example from the Pyspark documentation and withColumnRenamed() doesn't work when I add the last two lines here:

from pyspark.sql.functions import col, desc
df = spark.createDataFrame(
    [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
df_as1 = df.alias("df_as1")
df_as2 = df.alias("df_as2")
joined_df = df_as1.join(df_as2, col("df_as1.name") == col("df_as2.name"), 'inner')

joined_df.select(
    "df_as1.name", "df_as2.name", "df_as2.age").sort(desc("df_as1.name")).show()

renamed_df = joined_df.withColumnRenamed("df_as1.age", "age_renamed")
renamed_df.show()

I get exceptions for both of the last two lines when running in a Azure Databricks 14.3 LTS notebook.

For what it’s worth, I did ask an AI what is going on here and received a response that the behavior of `withColumnRenamed()` changes when Unity Catalog is enabled because of data lineage tracking:

However I cannot find any official references to this. (And this isn't exactly what we are experiencing.)

Does anyone know anything about this? Thanks for your help.

Kevin