cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value ??.

shampa
New Contributor

I have two files and I created two dataframes prod1 and prod2 out of it.I need to find the records with column names and values that are not matching in both the dfs.

id_sk is the primary key .all the cols are string datatype

dataframe 1 (prod1)

id_sk | uuid|name

1|10|a

2|20|b

3|30|c

dataframe 2 (prod2)

id_sk | uuid|name

2|20|b-upd

3|30-up|c

4|40|d

so I need the result dataframe in the below format.

id|col_name|values

2|name|b,b-upd

3|uuid|30,30-up

I did a inner join and compared the unmatched records.

I am getting the result as follows :

id_sk | uuid_prod1|uid_prod2|name_prod1|name_prod2

2|20|20|b|b-upd

3|30|30-up|c|c

val commmon_rec = prod1.join(prod2,prod1("id_sk")===prod2("id_sk"),"inner").select(prod1("id_sk").alias("id_sk_prod1"),prod1("uuid").alias("uuid_prod1"),prod1("name").alias("name_prod1"),prod1("name").alias("name_prod2")

val compare = spark.sql("select ...from common_rec where col_prod1<>col_prod2")

1 REPLY 1

manojlukhi
New Contributor II

use full Outer Join in spark SQL

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.