<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Comparing two dataframes in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29798#M21501</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I am stuck with the same issue.. Any new updates on this? &lt;/P&gt;,
&lt;P&gt;Is there any solution to this problem?? &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 20 Sep 2016 22:29:42 GMT</pubDate>
    <dc:creator>ShashishekharDe</dc:creator>
    <dc:date>2016-09-20T22:29:42Z</dc:date>
    <item>
      <title>Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29792#M21495</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;How can we compare two data frames using pyspark &lt;/P&gt;
&lt;P&gt;I need to validate my output with another dataset &lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2016 20:53:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29792#M21495</guid>
      <dc:creator>SiddarthaPaturu</dc:creator>
      <dc:date>2016-03-31T20:53:51Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29793#M21496</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt; df1.subtract(df2)&lt;/P&gt; 
&lt;P&gt;As per API Docs, it returns a new DataFrame containing rows in this frame but not in another frame.&lt;/P&gt; 
&lt;P&gt;This is equivalent to EXCEPT in SQL.&lt;/P&gt; 
&lt;P&gt;&lt;A href="https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.subtract" target="test_blank"&gt;https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.subtract&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2016 22:22:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29793#M21496</guid>
      <dc:creator>girivaratharaja</dc:creator>
      <dc:date>2016-03-31T22:22:46Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29794#M21497</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Its giving only the rows which or not in other data frame, Is there anything that validate all the column values in both the df&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Apr 2016 13:38:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29794#M21497</guid>
      <dc:creator>SiddarthaPaturu</dc:creator>
      <dc:date>2016-04-04T13:38:53Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29795#M21498</link>
      <description>&lt;P&gt;@Siddartha Paturu​&amp;nbsp;I If that is the case, I would recommend to do Join between two dataframes and then compare it for all columns&lt;/P&gt;</description>
      <pubDate>Mon, 04 Apr 2016 17:20:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29795#M21498</guid>
      <dc:creator>girivaratharaja</dc:creator>
      <dc:date>2016-04-04T17:20:22Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29796#M21499</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;how can we compare the columns ?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 05 Apr 2016 15:36:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29796#M21499</guid>
      <dc:creator>SiddarthaPaturu</dc:creator>
      <dc:date>2016-04-05T15:36:14Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29797#M21500</link>
      <description>&lt;P&gt;recently I am also stuck with this situation. can somebody help me with how to compare columns in this scenario. @Siddartha Paturu​&amp;nbsp; please help me out with this if already found the solution. Thanks in advance.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jul 2016 15:21:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29797#M21500</guid>
      <dc:creator>jagannathsahoo</dc:creator>
      <dc:date>2016-07-20T15:21:40Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29798#M21501</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I am stuck with the same issue.. Any new updates on this? &lt;/P&gt;,
&lt;P&gt;Is there any solution to this problem?? &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Sep 2016 22:29:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29798#M21501</guid>
      <dc:creator>ShashishekharDe</dc:creator>
      <dc:date>2016-09-20T22:29:42Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29799#M21502</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Try using &lt;/P&gt;
&lt;P&gt;&lt;PRE&gt;&lt;CODE&gt;all.equal&lt;/CODE&gt;&lt;/PRE&gt;function. &lt;/P&gt;
&lt;P&gt;It does not sort the dataframes but it checks each cell in&lt;PRE&gt;&lt;CODE&gt;data frame&lt;/CODE&gt;&lt;/PRE&gt;against the same cell in another one. You can also use&lt;PRE&gt;&lt;CODE&gt;identical()&lt;/CODE&gt;&lt;/PRE&gt;function.&lt;/P&gt;
&lt;P&gt;I would like to share a link which may help to solve your problem &lt;A href="https://goo.gl/pgLaEd" target="test_blank"&gt;https://goo.gl/pgLaEd&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Sep 2016 08:27:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29799#M21502</guid>
      <dc:creator>amandaphy</dc:creator>
      <dc:date>2016-09-24T08:27:24Z</dc:date>
    </item>
    <item>
      <title>Re: Comparing two dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29800#M21503</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example, &lt;/P&gt;
&lt;P&gt;let df1 and df2 are two dataframes. df1 has column (A,B,C) and df2 has columns (D,C,B), then you can create a new dataframe which would be the intersection of df1 and df2 conditioned on column B and C.&lt;/P&gt;
&lt;P&gt;df3 = df1.join(df2, [df1.B == df2.B , df1.C == df2.C], how = 'inner' )&lt;/P&gt;
&lt;P&gt;df3 will contain only those rows where the above condition is satisfied from df1 and df2.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jun 2018 13:53:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/comparing-two-dataframes/m-p/29800#M21503</guid>
      <dc:creator>sbharti</dc:creator>
      <dc:date>2018-06-28T13:53:44Z</dc:date>
    </item>
  </channel>
</rss>

