<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic withColumnRenamed error on Unity Catalog 14.3 LTS in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/withcolumnrenamed-error-on-unity-catalog-14-3-lts/m-p/61711#M31833</link>
    <description>&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;P&gt;Hi -&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We are migrating to Unity Catalog 14.3 LTS and have seen a change in behavior using withColumnRenamed.&lt;/P&gt;&lt;P&gt;There is an error COLUMN_ALREADY_EXISTS on the join key, even though the column being renamed is a different column.&amp;nbsp; &amp;nbsp;The joined DataFrame does behave fine before the renaming.&amp;nbsp; &amp;nbsp;&lt;/P&gt;&lt;P&gt;We use the join syntax using a join expression like this:&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df_1 = spark.createDataFrame([(1, "a", 1)], ["key", "field1", "field2"])
df_2 = spark.createDataFrame([(1, "test", 1000)], ["key", "data1", "data2"])

# this fails with UC and works find with non-UC

join_df = df_1.join(df_2, df_1.key == df_2.key)
#join_df = join_df.drop(df_2.key)

rename_df = join_df.withColumnRenamed("data1", "rename_data1")

join_df.display()
rename_df.display()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We receive this error:&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;[COLUMN_ALREADY_EXISTS] The column `key` already exists. Consider to choose another name or rename the existing column. SQLSTATE: 42711&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;A few other observations:&lt;/P&gt;&lt;P&gt;- If we use syntax that specifies a string for the column name the behavior is fine,&amp;nbsp;&lt;SPAN&gt;join_df2 &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; df_1.&lt;/SPAN&gt;&lt;SPAN&gt;join&lt;/SPAN&gt;&lt;SPAN&gt;(df_2, &lt;/SPAN&gt;&lt;SPAN&gt;"key"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;- if we uncomment the drop line above it's fine&lt;/P&gt;&lt;P&gt;- aliasing the two tables before the join doesn't work&lt;/P&gt;&lt;P&gt;- it works&amp;nbsp;works fine in 13.3&amp;nbsp;LTS and appeared in 14.0 and later&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there something we are doing wrong or should be doing differently?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 23 Feb 2024 13:27:08 GMT</pubDate>
    <dc:creator>ksamborn</dc:creator>
    <dc:date>2024-02-23T13:27:08Z</dc:date>
    <item>
      <title>withColumnRenamed error on Unity Catalog 14.3 LTS</title>
      <link>https://community.databricks.com/t5/data-engineering/withcolumnrenamed-error-on-unity-catalog-14-3-lts/m-p/61711#M31833</link>
      <description>&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;P&gt;Hi -&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We are migrating to Unity Catalog 14.3 LTS and have seen a change in behavior using withColumnRenamed.&lt;/P&gt;&lt;P&gt;There is an error COLUMN_ALREADY_EXISTS on the join key, even though the column being renamed is a different column.&amp;nbsp; &amp;nbsp;The joined DataFrame does behave fine before the renaming.&amp;nbsp; &amp;nbsp;&lt;/P&gt;&lt;P&gt;We use the join syntax using a join expression like this:&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df_1 = spark.createDataFrame([(1, "a", 1)], ["key", "field1", "field2"])
df_2 = spark.createDataFrame([(1, "test", 1000)], ["key", "data1", "data2"])

# this fails with UC and works find with non-UC

join_df = df_1.join(df_2, df_1.key == df_2.key)
#join_df = join_df.drop(df_2.key)

rename_df = join_df.withColumnRenamed("data1", "rename_data1")

join_df.display()
rename_df.display()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We receive this error:&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;[COLUMN_ALREADY_EXISTS] The column `key` already exists. Consider to choose another name or rename the existing column. SQLSTATE: 42711&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;A few other observations:&lt;/P&gt;&lt;P&gt;- If we use syntax that specifies a string for the column name the behavior is fine,&amp;nbsp;&lt;SPAN&gt;join_df2 &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; df_1.&lt;/SPAN&gt;&lt;SPAN&gt;join&lt;/SPAN&gt;&lt;SPAN&gt;(df_2, &lt;/SPAN&gt;&lt;SPAN&gt;"key"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;- if we uncomment the drop line above it's fine&lt;/P&gt;&lt;P&gt;- aliasing the two tables before the join doesn't work&lt;/P&gt;&lt;P&gt;- it works&amp;nbsp;works fine in 13.3&amp;nbsp;LTS and appeared in 14.0 and later&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there something we are doing wrong or should be doing differently?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Feb 2024 13:27:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/withcolumnrenamed-error-on-unity-catalog-14-3-lts/m-p/61711#M31833</guid>
      <dc:creator>ksamborn</dc:creator>
      <dc:date>2024-02-23T13:27:08Z</dc:date>
    </item>
    <item>
      <title>Re: withColumnRenamed error on Unity Catalog 14.3 LTS</title>
      <link>https://community.databricks.com/t5/data-engineering/withcolumnrenamed-error-on-unity-catalog-14-3-lts/m-p/61932#M31854</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100689"&gt;@ksamborn&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I can think of 2 solutions:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN&gt;Rename the column in &lt;/SPAN&gt;df_2&lt;SPAN&gt; before joining:&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df_1_alias = df_1.alias("t1")
df_2_alias = df_2.alias("t2")
join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key)
rename_df = join_df.withColumnRenamed("t2.data1", "rename_data1")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;2.&amp;nbsp;&lt;SPAN&gt;Use aliases for the join tables before the join:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df_1_alias = df_1.alias("t1")
df_2_alias = df_2.alias("t2")
join_df = df_1_alias.join(df_2_alias, df_1_alias.key == df_2_alias.key)
rename_df = join_df.withColumnRenamed("t2.data1", "rename_data1")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Let us know if this works otherwise followups are appreciated.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 25 Feb 2024 08:30:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/withcolumnrenamed-error-on-unity-catalog-14-3-lts/m-p/61932#M31854</guid>
      <dc:creator>Palash01</dc:creator>
      <dc:date>2024-02-25T08:30:15Z</dc:date>
    </item>
    <item>
      <title>Re: withColumnRenamed error on Unity Catalog 14.3 LTS</title>
      <link>https://community.databricks.com/t5/data-engineering/withcolumnrenamed-error-on-unity-catalog-14-3-lts/m-p/61985#M31865</link>
      <description>&lt;P&gt;Hi Palash&lt;/P&gt;&lt;P&gt;Thanks for your reply.&lt;/P&gt;&lt;P&gt;In short, renaming the column before the join and before the column rename works (as does selecting only the columns that aren't duplicate keys).&lt;/P&gt;&lt;P&gt;The aliasing does not work.&amp;nbsp;&amp;nbsp; In fact, I've used the standard example from the Pyspark documentation and withColumnRenamed() doesn't work when I add the last two lines here:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.functions import col, desc
df = spark.createDataFrame(
    [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
df_as1 = df.alias("df_as1")
df_as2 = df.alias("df_as2")
joined_df = df_as1.join(df_as2, col("df_as1.name") == col("df_as2.name"), 'inner')

joined_df.select(
    "df_as1.name", "df_as2.name", "df_as2.age").sort(desc("df_as1.name")).show()

renamed_df = joined_df.withColumnRenamed("df_as1.age", "age_renamed")
renamed_df.show()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I get exceptions for both of the last two lines when running in a Azure Databricks 14.3 LTS notebook.&lt;/P&gt;&lt;P&gt;For what it’s worth, I did ask an AI what is going on here and received a response that the behavior of `withColumnRenamed()` changes when Unity Catalog is enabled because of data lineage tracking:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ksamborn_0-1708957061285.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/6437iF15236B9ED295EAE/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="ksamborn_0-1708957061285.png" alt="ksamborn_0-1708957061285.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However I cannot find any official references to this.&amp;nbsp; (And this isn't exactly what we are experiencing.)&lt;/P&gt;&lt;P&gt;Does anyone know anything about this?&amp;nbsp;&amp;nbsp; Thanks for your help.&lt;/P&gt;&lt;P&gt;Kevin&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 26 Feb 2024 14:17:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/withcolumnrenamed-error-on-unity-catalog-14-3-lts/m-p/61985#M31865</guid>
      <dc:creator>ksamborn</dc:creator>
      <dc:date>2024-02-26T14:17:58Z</dc:date>
    </item>
  </channel>
</rss>

