Why does a join on (df1.id == df2.id) result in duplicate columns while on="id" does not?

Tanay — Mon, 23 Dec 2024 07:50:21 GMT

Why does a join with on (df1.id == df2.id) result in duplicate columns, but on="id" does not?

I encountered an interesting behavior while performing a join on two Data frames. Here's the scenario:

df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"]) df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "city"])

When I join the Data frames like this:

joined_df = df1.join(df2, on = (df1.id == df2.id), how = "inner")

It results in the id column appearing twice in the result.

However, when I modify the join to:

joined_df = df1.join(df2, on="id", how="inner")

It only keeps one id column, which is the behavior I was expecting.

Can anyone explain why this happens? Does it have to do with how Spark handles column names or the join condition? Any insight would be appreciated!

Re: Why does a join on (df1.id == df2.id) result in duplicate columns while on="id" does n

szymon_dybczak — Mon, 23 Dec 2024 09:12:48 GMT

Hi @Tanay ,

Your intuition is correct here. In Apache Spark, the difference in behavior between on (df1.id == df2.id) and on="id" in a join stems from how Spark resolves and handles column naming during the join operation.

When you use the first syntax (df1.id == df2.id), you are explicitly specifying a join condition using a column expression. Spark does not automatically deduplicate columns when using this approach.
Instead, the resulting DataFrame will contain all columns from both DataFrames, even if they have the same name. For example, if both DataFrames have a column named EmployeeNumber, the result will contain two columns:

EmployeeNumber from df1
EmployeeNumber from df2

Spark handles the name conflict by qualifying the column names with their respective DataFrame aliases (if provided) or default names. For example, you might see df1.id and df2.id.

When you use the on="id" syntax, you are specifying that the join should be based on a column with the same name in both DataFrames. Spark recognizes this as a natural join key and automatically deduplicates the column in the output.
The resulting DataFrame will contain only one id column, which corresponds to the join key, and it will take the value from one of the DataFrames (typically the left DataFrame unless specified otherwise).

And why we have this difference in behavioir? The difference lies in how explicit column expressions (df1.id == df2.id) versus string column names (on="id") are interpreted:

Explicit Expressions (df1.id == df2.id):
- Spark treats df1.id and df2.id as separate entities.
- No deduplication occurs; both columns are retained.
String Columns (on="id"):
- Spark identifies that id is the same column in both DataFrames.
- Spark deduplicates the column by keeping only one instance of id.

topic Re: Why does a join on (df1.id == df2.id) result in duplicate columns while on="id" does n in Get Started Discussions

Why does a join on (df1.id == df2.id) result in duplicate columns while on="id" does not?

Re: Why does a join on (df1.id == df2.id) result in duplicate columns while on="id" does n