<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value ??. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-can-we-compare-two-dataframes-in-spark-scala-to-find/m-p/28341#M20161</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have two files and I created two dataframes prod1 and prod2 out of it.I need to find the records with column names and values that are not matching in both the dfs.&lt;/P&gt;
&lt;P&gt;id_sk is the primary key .all the cols are string datatype&lt;/P&gt;
&lt;P&gt;dataframe 1 (prod1)&lt;/P&gt;
&lt;P&gt;id_sk | uuid|name&lt;/P&gt;
&lt;P&gt;1|10|a&lt;/P&gt;
&lt;P&gt;2|20|b&lt;/P&gt;
&lt;P&gt;3|30|c&lt;/P&gt;
&lt;P&gt;dataframe 2 (prod2)&lt;/P&gt;
&lt;P&gt;id_sk | uuid|name&lt;/P&gt;
&lt;P&gt;2|20|b-upd&lt;/P&gt;
&lt;P&gt;3|30-up|c&lt;/P&gt;
&lt;P&gt;4|40|d&lt;/P&gt;
&lt;P&gt;so I need the result dataframe in the below format.&lt;/P&gt;
&lt;P&gt;id|col_name|values&lt;/P&gt;
&lt;P&gt;2|name|b,b-upd&lt;/P&gt;
&lt;P&gt;3|uuid|30,30-up&lt;/P&gt;
&lt;P&gt;I did a inner join and compared the unmatched records.&lt;/P&gt;
&lt;P&gt;I am getting the result as follows :&lt;/P&gt;
&lt;P&gt;id_sk | uuid_prod1|uid_prod2|name_prod1|name_prod2&lt;/P&gt;
&lt;P&gt;2|20|20|b|b-upd&lt;/P&gt;
&lt;P&gt;3|30|30-up|c|c&lt;/P&gt;
&lt;P&gt;val commmon_rec = prod1.join(prod2,prod1("id_sk")===prod2("id_sk"),"inner").select(prod1("id_sk").alias("id_sk_prod1"),prod1("uuid").alias("uuid_prod1"),prod1("name").alias("name_prod1"),prod1("name").alias("name_prod2")&lt;/P&gt;
&lt;P&gt;val compare = spark.sql("select ...from common_rec where col_prod1&amp;lt;&amp;gt;col_prod2")&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 20 Jan 2019 07:00:31 GMT</pubDate>
    <dc:creator>shampa</dc:creator>
    <dc:date>2019-01-20T07:00:31Z</dc:date>
    <item>
      <title>How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value ??.</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-we-compare-two-dataframes-in-spark-scala-to-find/m-p/28341#M20161</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have two files and I created two dataframes prod1 and prod2 out of it.I need to find the records with column names and values that are not matching in both the dfs.&lt;/P&gt;
&lt;P&gt;id_sk is the primary key .all the cols are string datatype&lt;/P&gt;
&lt;P&gt;dataframe 1 (prod1)&lt;/P&gt;
&lt;P&gt;id_sk | uuid|name&lt;/P&gt;
&lt;P&gt;1|10|a&lt;/P&gt;
&lt;P&gt;2|20|b&lt;/P&gt;
&lt;P&gt;3|30|c&lt;/P&gt;
&lt;P&gt;dataframe 2 (prod2)&lt;/P&gt;
&lt;P&gt;id_sk | uuid|name&lt;/P&gt;
&lt;P&gt;2|20|b-upd&lt;/P&gt;
&lt;P&gt;3|30-up|c&lt;/P&gt;
&lt;P&gt;4|40|d&lt;/P&gt;
&lt;P&gt;so I need the result dataframe in the below format.&lt;/P&gt;
&lt;P&gt;id|col_name|values&lt;/P&gt;
&lt;P&gt;2|name|b,b-upd&lt;/P&gt;
&lt;P&gt;3|uuid|30,30-up&lt;/P&gt;
&lt;P&gt;I did a inner join and compared the unmatched records.&lt;/P&gt;
&lt;P&gt;I am getting the result as follows :&lt;/P&gt;
&lt;P&gt;id_sk | uuid_prod1|uid_prod2|name_prod1|name_prod2&lt;/P&gt;
&lt;P&gt;2|20|20|b|b-upd&lt;/P&gt;
&lt;P&gt;3|30|30-up|c|c&lt;/P&gt;
&lt;P&gt;val commmon_rec = prod1.join(prod2,prod1("id_sk")===prod2("id_sk"),"inner").select(prod1("id_sk").alias("id_sk_prod1"),prod1("uuid").alias("uuid_prod1"),prod1("name").alias("name_prod1"),prod1("name").alias("name_prod2")&lt;/P&gt;
&lt;P&gt;val compare = spark.sql("select ...from common_rec where col_prod1&amp;lt;&amp;gt;col_prod2")&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 20 Jan 2019 07:00:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-we-compare-two-dataframes-in-spark-scala-to-find/m-p/28341#M20161</guid>
      <dc:creator>shampa</dc:creator>
      <dc:date>2019-01-20T07:00:31Z</dc:date>
    </item>
    <item>
      <title>Re: How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value ??.</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-we-compare-two-dataframes-in-spark-scala-to-find/m-p/28342#M20162</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;use full Outer Join in spark SQL &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 05 Feb 2019 06:14:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-we-compare-two-dataframes-in-spark-scala-to-find/m-p/28342#M20162</guid>
      <dc:creator>manojlukhi</dc:creator>
      <dc:date>2019-02-05T06:14:48Z</dc:date>
    </item>
  </channel>
</rss>

