- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2016 01:53 PM
How can we compare two data frames using pyspark
I need to validate my output with another dataset
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2016 03:22 PM
>>> df1.subtract(df2)
As per API Docs, it returns a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2016 03:22 PM
>>> df1.subtract(df2)
As per API Docs, it returns a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2016 06:38 AM
Its giving only the rows which or not in other data frame, Is there anything that validate all the column values in both the df
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2016 10:20 AM
@Siddartha Paturu I If that is the case, I would recommend to do Join between two dataframes and then compare it for all columns
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2016 08:36 AM
how can we compare the columns ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-20-2016 08:21 AM
recently I am also stuck with this situation. can somebody help me with how to compare columns in this scenario. @Siddartha Paturu please help me out with this if already found the solution. Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-20-2016 03:29 PM
I am stuck with the same issue.. Any new updates on this?
,Is there any solution to this problem??
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-24-2016 01:27 AM
Try using
all.equal
function.
It does not sort the dataframes but it checks each cell in
data frame
against the same cell in another one. You can also useidentical()
function.
I would like to share a link which may help to solve your problem https://goo.gl/pgLaEd
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2018 06:53 AM
I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example,
let df1 and df2 are two dataframes. df1 has column (A,B,C) and df2 has columns (D,C,B), then you can create a new dataframe which would be the intersection of df1 and df2 conditioned on column B and C.
df3 = df1.join(df2, [df1.B == df2.B , df1.C == df2.C], how = 'inner' )
df3 will contain only those rows where the above condition is satisfied from df1 and df2.

