cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

I want to compare two data frames. In output I wish to see unmatched Rows and the columns identified leading to the differences.

Databricks_POC
New Contributor II
 
1 ACCEPTED SOLUTION

Accepted Solutions

Pholo
Contributor

Hi @vinita shindeโ€‹ 

I don't know if It meets your requirement, but you can try something like that

import pyspark.sql.functions as F
from pyspark.sql import Row
 
def is_different(col1, col2, cols):
  if col2 in cols:
    return F.col(col1) == F.col(col2)
  else:
    return F.col(col1).alias(col1)
 
def find_difference(data1, data2, id_columns):
  cols = data1.columns
  data_difference = data1.subtract(data2) # First you find the difference in rows
  data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
  # you join the differences with the second dataframe
  return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
 
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
  

The results will be somethong like that

imageLet me know if It meets your requirements

View solution in original post

6 REPLIES 6

Kaniz
Community Manager
Community Manager

Hi @Databricks POC! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Anonymous
Not applicable

Are you trying to do change data capture between an older and newer version of the same table?

Databricks_POC
New Contributor II

No, I wish to compare two tables. Same Scenario as in case of Minus/Except query. However apart from the mismatched rows, I wish to also know which are those columns leading to the difference.

Pholo
Contributor

Hi @vinita shindeโ€‹ 

I don't know if It meets your requirement, but you can try something like that

import pyspark.sql.functions as F
from pyspark.sql import Row
 
def is_different(col1, col2, cols):
  if col2 in cols:
    return F.col(col1) == F.col(col2)
  else:
    return F.col(col1).alias(col1)
 
def find_difference(data1, data2, id_columns):
  cols = data1.columns
  data_difference = data1.subtract(data2) # First you find the difference in rows
  data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
  # you join the differences with the second dataframe
  return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
 
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
  

The results will be somethong like that

imageLet me know if It meets your requirements

How to print only what columns changed instead of displaying all columns?

I just want to print for every row PRIMARY Key (ID) and only column names that changed in the second data frame as comma separated.

bhargavi1
New Contributor II

@vinita shindeโ€‹ are you Cracked this Code?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.