Why does a join with on (df1.id == df2.id) result in duplicate columns, but on="id" does not?
I encountered an interesting behavior while performing a join on two Data frames. Here's the scenario:
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "city"])
When I join the Data frames like this:
joined_df = df1.join(df2, on = (df1.id == df2.id), how = "inner")
It results in the id column appearing twice in the result.
However, when I modify the join to:
joined_df = df1.join(df2, on="id", how="inner")
It only keeps one id column, which is the behavior I was expecting.
Can anyone explain why this happens? Does it have to do with how Spark handles column names or the join condition? Any insight would be appreciated!