cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Comparing two dataframes

SiddarthaPaturu
New Contributor II

How can we compare two data frames using pyspark

I need to validate my output with another dataset

1 ACCEPTED SOLUTION

Accepted Solutions

girivaratharaja
New Contributor III

>>> df1.subtract(df2)

As per API Docs, it returns a new DataFrame containing rows in this frame but not in another frame.

This is equivalent to EXCEPT in SQL.

https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.Data...

View solution in original post

8 REPLIES 8

girivaratharaja
New Contributor III

>>> df1.subtract(df2)

As per API Docs, it returns a new DataFrame containing rows in this frame but not in another frame.

This is equivalent to EXCEPT in SQL.

https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.Data...

SiddarthaPaturu
New Contributor II

Its giving only the rows which or not in other data frame, Is there anything that validate all the column values in both the df

girivaratharaja
New Contributor III

@Siddartha Paturuโ€‹ I If that is the case, I would recommend to do Join between two dataframes and then compare it for all columns

SiddarthaPaturu
New Contributor II

how can we compare the columns ?

jagannathsahoo
New Contributor II

recently I am also stuck with this situation. can somebody help me with how to compare columns in this scenario. @Siddartha Paturuโ€‹  please help me out with this if already found the solution. Thanks in advance.

ShashishekharDe
New Contributor II

I am stuck with the same issue.. Any new updates on this?

,

Is there any solution to this problem??

amandaphy
New Contributor II

Try using

all.equal
function.

It does not sort the dataframes but it checks each cell in

data frame
against the same cell in another one. You can also use
identical()
function.

I would like to share a link which may help to solve your problem https://goo.gl/pgLaEd

sbharti
New Contributor II

I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example,

let df1 and df2 are two dataframes. df1 has column (A,B,C) and df2 has columns (D,C,B), then you can create a new dataframe which would be the intersection of df1 and df2 conditioned on column B and C.

df3 = df1.join(df2, [df1.B == df2.B , df1.C == df2.C], how = 'inner' )

df3 will contain only those rows where the above condition is satisfied from df1 and df2.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group