02-23-2024 05:19 AM - edited 02-23-2024 05:27 AM
Hi -
We are migrating to Unity Catalog 14.3 LTS and have seen a change in behavior using withColumnRenamed.
There is an error COLUMN_ALREADY_EXISTS on the join key, even though the column being renamed is a different column. The joined DataFrame does behave fine before the renaming.
We use the join syntax using a join expression like this:
df_1 = spark.createDataFrame([(1, "a", 1)], ["key", "field1", "field2"])
df_2 = spark.createDataFrame([(1, "test", 1000)], ["key", "data1", "data2"])
# this fails with UC and works find with non-UC
join_df = df_1.join(df_2, df_1.key == df_2.key)
#join_df = join_df.drop(df_2.key)
rename_df = join_df.withColumnRenamed("data1", "rename_data1")
join_df.display()
rename_df.display()
We receive this error:
[COLUMN_ALREADY_EXISTS] The column `key` already exists. Consider to choose another name or rename the existing column. SQLSTATE: 42711
A few other observations:
- If we use syntax that specifies a string for the column name the behavior is fine, join_df2 = df_1.join(df_2, "key")
- if we uncomment the drop line above it's fine
- aliasing the two tables before the join doesn't work
- it works works fine in 13.3 LTS and appeared in 14.0 and later
Is there something we are doing wrong or should be doing differently?
Thanks for your help!
02-25-2024 12:29 AM - edited 02-25-2024 12:30 AM
Hey @ksamborn
I can think of 2 solutions:
df_1_alias = df_1.alias("t1")
df_2_alias = df_2.alias("t2")
join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key)
rename_df = join_df.withColumnRenamed("t2.data1", "rename_data1")
2. Use aliases for the join tables before the join:
df_1_alias = df_1.alias("t1")
df_2_alias = df_2.alias("t2")
join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key)
rename_df = join_df.withColumnRenamed("t2.data1", "rename_data1")
Let us know if this works otherwise followups are appreciated.
02-26-2024 06:17 AM
Hi Palash
Thanks for your reply.
In short, renaming the column before the join and before the column rename works (as does selecting only the columns that aren't duplicate keys).
The aliasing does not work. In fact, I've used the standard example from the Pyspark documentation and withColumnRenamed() doesn't work when I add the last two lines here:
from pyspark.sql.functions import col, desc
df = spark.createDataFrame(
[(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
df_as1 = df.alias("df_as1")
df_as2 = df.alias("df_as2")
joined_df = df_as1.join(df_as2, col("df_as1.name") == col("df_as2.name"), 'inner')
joined_df.select(
"df_as1.name", "df_as2.name", "df_as2.age").sort(desc("df_as1.name")).show()
renamed_df = joined_df.withColumnRenamed("df_as1.age", "age_renamed")
renamed_df.show()
I get exceptions for both of the last two lines when running in a Azure Databricks 14.3 LTS notebook.
For what it’s worth, I did ask an AI what is going on here and received a response that the behavior of `withColumnRenamed()` changes when Unity Catalog is enabled because of data lineage tracking:
However I cannot find any official references to this. (And this isn't exactly what we are experiencing.)
Does anyone know anything about this? Thanks for your help.
Kevin
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group