cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Comparing two dataframes

SiddarthaPaturu
New Contributor II

How can we compare two data frames using pyspark

I need to validate my output with another dataset

1 ACCEPTED SOLUTION

Accepted Solutions

girivaratharaja
New Contributor III

>>> df1.subtract(df2)

As per API Docs, it returns a new DataFrame containing rows in this frame but not in another frame.

This is equivalent to EXCEPT in SQL.

https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.Data...

View solution in original post

8 REPLIES 8

girivaratharaja
New Contributor III

>>> df1.subtract(df2)

As per API Docs, it returns a new DataFrame containing rows in this frame but not in another frame.

This is equivalent to EXCEPT in SQL.

https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.Data...

SiddarthaPaturu
New Contributor II

Its giving only the rows which or not in other data frame, Is there anything that validate all the column values in both the df

girivaratharaja
New Contributor III

@Siddartha Paturu​ I If that is the case, I would recommend to do Join between two dataframes and then compare it for all columns

SiddarthaPaturu
New Contributor II

how can we compare the columns ?

jagannathsahoo
New Contributor II

recently I am also stuck with this situation. can somebody help me with how to compare columns in this scenario. @Siddartha Paturu​  please help me out with this if already found the solution. Thanks in advance.

ShashishekharDe
New Contributor II

I am stuck with the same issue.. Any new updates on this?

,

Is there any solution to this problem??

amandaphy
New Contributor II

Try using

all.equal
function.

It does not sort the dataframes but it checks each cell in

data frame
against the same cell in another one. You can also use
identical()
function.

I would like to share a link which may help to solve your problem https://goo.gl/pgLaEd

sbharti
New Contributor II

I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example,

let df1 and df2 are two dataframes. df1 has column (A,B,C) and df2 has columns (D,C,B), then you can create a new dataframe which would be the intersection of df1 and df2 conditioned on column B and C.

df3 = df1.join(df2, [df1.B == df2.B , df1.C == df2.C], how = 'inner' )

df3 will contain only those rows where the above condition is satisfied from df1 and df2.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.