cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why does a join on (df1.id == df2.id) result in duplicate columns while on="id" does not?

Tanay
New Contributor II

Why does a join with on (df1.id == df2.id) result in duplicate columns, but on="id" does not?

I encountered an interesting behavior while performing a join on two Data frames. Here's the scenario:

 

df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "city"])

 

When I join the Data frames like this:

 

joined_df = df1.join(df2, on = (df1.id == df2.id), how = "inner")

 

It results in the id column appearing twice in the result.

However, when I modify the join to:

 

joined_df = df1.join(df2, on="id", how="inner")

 

It only keeps one id column, which is the behavior I was expecting.

Can anyone explain why this happens? Does it have to do with how Spark handles column names or the join condition? Any insight would be appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Contributor III

Hi @Tanay ,

 
Your intuition is correct here. In Apache Spark, the difference in behavior between on (df1.id == df2.id) and on="id" in a join stems from how Spark resolves and handles column naming during the join operation.

When you use the first syntax (df1.id == df2.id), you are explicitly specifying a join condition using a column expression. Spark does not automatically deduplicate columns when using this approach.
Instead, the resulting DataFrame will contain all columns from both DataFrames, even if they have the same name. For example, if both DataFrames have a column named EmployeeNumber, the result will contain two columns:

  • EmployeeNumber from df1
  • EmployeeNumber from df2

Spark handles the name conflict by qualifying the column names with their respective DataFrame aliases (if provided) or default names. For example, you might see df1.id and df2.id.

When you use the on="id"  syntax, you are specifying that the join should be based on a column with the same name in both DataFrames. Spark recognizes this as a natural join key and automatically deduplicates the column in the output.
The resulting DataFrame will contain only one id column, which corresponds to the join key, and it will take the value from one of the DataFrames (typically the left DataFrame unless specified otherwise).

And why we have this difference in behavioir? The difference lies in how explicit column expressions (df1.id == df2.id) versus string column names (on="id") are interpreted:

  • Explicit Expressions (df1.id == df2.id):

    • Spark treats df1.id and df2.id as separate entities.
    • No deduplication occurs; both columns are retained.
  • String Columns (on="id"):

    • Spark identifies that id is the same column in both DataFrames.
    • Spark deduplicates the column by keeping only one instance of id.

View solution in original post

1 REPLY 1

szymon_dybczak
Contributor III

Hi @Tanay ,

 
Your intuition is correct here. In Apache Spark, the difference in behavior between on (df1.id == df2.id) and on="id" in a join stems from how Spark resolves and handles column naming during the join operation.

When you use the first syntax (df1.id == df2.id), you are explicitly specifying a join condition using a column expression. Spark does not automatically deduplicate columns when using this approach.
Instead, the resulting DataFrame will contain all columns from both DataFrames, even if they have the same name. For example, if both DataFrames have a column named EmployeeNumber, the result will contain two columns:

  • EmployeeNumber from df1
  • EmployeeNumber from df2

Spark handles the name conflict by qualifying the column names with their respective DataFrame aliases (if provided) or default names. For example, you might see df1.id and df2.id.

When you use the on="id"  syntax, you are specifying that the join should be based on a column with the same name in both DataFrames. Spark recognizes this as a natural join key and automatically deduplicates the column in the output.
The resulting DataFrame will contain only one id column, which corresponds to the join key, and it will take the value from one of the DataFrames (typically the left DataFrame unless specified otherwise).

And why we have this difference in behavioir? The difference lies in how explicit column expressions (df1.id == df2.id) versus string column names (on="id") are interpreted:

  • Explicit Expressions (df1.id == df2.id):

    • Spark treats df1.id and df2.id as separate entities.
    • No deduplication occurs; both columns are retained.
  • String Columns (on="id"):

    • Spark identifies that id is the same column in both DataFrames.
    • Spark deduplicates the column by keeping only one instance of id.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group