12-20-2021 01:14 AM
01-11-2022 10:00 AM
Hi @vinita shinde
I don't know if It meets your requirement, but you can try something like that
import pyspark.sql.functions as F
from pyspark.sql import Row
def is_different(col1, col2, cols):
if col2 in cols:
return F.col(col1) == F.col(col2)
else:
return F.col(col1).alias(col1)
def find_difference(data1, data2, id_columns):
cols = data1.columns
data_difference = data1.subtract(data2) # First you find the difference in rows
data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
# you join the differences with the second dataframe
return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
The results will be somethong like that
Let me know if It meets your requirements
12-20-2021 11:26 AM
Are you trying to do change data capture between an older and newer version of the same table?
12-21-2021 12:59 AM
No, I wish to compare two tables. Same Scenario as in case of Minus/Except query. However apart from the mismatched rows, I wish to also know which are those columns leading to the difference.
01-11-2022 10:00 AM
Hi @vinita shinde
I don't know if It meets your requirement, but you can try something like that
import pyspark.sql.functions as F
from pyspark.sql import Row
def is_different(col1, col2, cols):
if col2 in cols:
return F.col(col1) == F.col(col2)
else:
return F.col(col1).alias(col1)
def find_difference(data1, data2, id_columns):
cols = data1.columns
data_difference = data1.subtract(data2) # First you find the difference in rows
data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
# you join the differences with the second dataframe
return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
The results will be somethong like that
Let me know if It meets your requirements
12-26-2023 10:32 AM
How to print only what columns changed instead of displaying all columns?
I just want to print for every row PRIMARY Key (ID) and only column names that changed in the second data frame as comma separated.
04-28-2022 01:53 AM
@vinita shinde are you Cracked this Code?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group