cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

I want to compare two data frames. In output I wish to see unmatched Rows and the columns identified leading to the differences.

Databricks_POC
New Contributor II
 
1 ACCEPTED SOLUTION

Accepted Solutions

Pholo
Contributor

Hi @vinita shindeโ€‹ 

I don't know if It meets your requirement, but you can try something like that

import pyspark.sql.functions as F
from pyspark.sql import Row
 
def is_different(col1, col2, cols):
  if col2 in cols:
    return F.col(col1) == F.col(col2)
  else:
    return F.col(col1).alias(col1)
 
def find_difference(data1, data2, id_columns):
  cols = data1.columns
  data_difference = data1.subtract(data2) # First you find the difference in rows
  data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
  # you join the differences with the second dataframe
  return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
 
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
  

The results will be somethong like that

imageLet me know if It meets your requirements

View solution in original post

5 REPLIES 5

Anonymous
Not applicable

Are you trying to do change data capture between an older and newer version of the same table?

Databricks_POC
New Contributor II

No, I wish to compare two tables. Same Scenario as in case of Minus/Except query. However apart from the mismatched rows, I wish to also know which are those columns leading to the difference.

Pholo
Contributor

Hi @vinita shindeโ€‹ 

I don't know if It meets your requirement, but you can try something like that

import pyspark.sql.functions as F
from pyspark.sql import Row
 
def is_different(col1, col2, cols):
  if col2 in cols:
    return F.col(col1) == F.col(col2)
  else:
    return F.col(col1).alias(col1)
 
def find_difference(data1, data2, id_columns):
  cols = data1.columns
  data_difference = data1.subtract(data2) # First you find the difference in rows
  data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
  # you join the differences with the second dataframe
  return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
 
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
  

The results will be somethong like that

imageLet me know if It meets your requirements

How to print only what columns changed instead of displaying all columns?

I just want to print for every row PRIMARY Key (ID) and only column names that changed in the second data frame as comma separated.

bhargavi1
New Contributor II

@vinita shindeโ€‹ are you Cracked this Code?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group