โ12-20-2021 01:14 AM
โ01-11-2022 10:00 AM
Hi @vinita shindeโ
I don't know if It meets your requirement, but you can try something like that
import pyspark.sql.functions as F
from pyspark.sql import Row
def is_different(col1, col2, cols):
if col2 in cols:
return F.col(col1) == F.col(col2)
else:
return F.col(col1).alias(col1)
def find_difference(data1, data2, id_columns):
cols = data1.columns
data_difference = data1.subtract(data2) # First you find the difference in rows
data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
# you join the differences with the second dataframe
return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
The results will be somethong like that
Let me know if It meets your requirements
โ12-20-2021 11:26 AM
Are you trying to do change data capture between an older and newer version of the same table?
โ12-21-2021 12:59 AM
No, I wish to compare two tables. Same Scenario as in case of Minus/Except query. However apart from the mismatched rows, I wish to also know which are those columns leading to the difference.
โ01-11-2022 10:00 AM
Hi @vinita shindeโ
I don't know if It meets your requirement, but you can try something like that
import pyspark.sql.functions as F
from pyspark.sql import Row
def is_different(col1, col2, cols):
if col2 in cols:
return F.col(col1) == F.col(col2)
else:
return F.col(col1).alias(col1)
def find_difference(data1, data2, id_columns):
cols = data1.columns
data_difference = data1.subtract(data2) # First you find the difference in rows
data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
# you join the differences with the second dataframe
return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
The results will be somethong like that
Let me know if It meets your requirements
โ12-26-2023 10:32 AM
How to print only what columns changed instead of displaying all columns?
I just want to print for every row PRIMARY Key (ID) and only column names that changed in the second data frame as comma separated.
โ04-28-2022 01:53 AM
@vinita shindeโ are you Cracked this Code?
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now