โ12-20-2021 01:14 AM
โ01-11-2022 10:00 AM
Hi @vinita shindeโ
I don't know if It meets your requirement, but you can try something like that
import pyspark.sql.functions as F
from pyspark.sql import Row
def is_different(col1, col2, cols):
if col2 in cols:
return F.col(col1) == F.col(col2)
else:
return F.col(col1).alias(col1)
def find_difference(data1, data2, id_columns):
cols = data1.columns
data_difference = data1.subtract(data2) # First you find the difference in rows
data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
# you join the differences with the second dataframe
return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
The results will be somethong like that
Let me know if It meets your requirements
โ12-20-2021 02:55 AM
Hi @Databricks POC! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.
โ12-20-2021 11:26 AM
Are you trying to do change data capture between an older and newer version of the same table?
โ12-21-2021 12:59 AM
No, I wish to compare two tables. Same Scenario as in case of Minus/Except query. However apart from the mismatched rows, I wish to also know which are those columns leading to the difference.
โ01-11-2022 10:00 AM
Hi @vinita shindeโ
I don't know if It meets your requirement, but you can try something like that
import pyspark.sql.functions as F
from pyspark.sql import Row
def is_different(col1, col2, cols):
if col2 in cols:
return F.col(col1) == F.col(col2)
else:
return F.col(col1).alias(col1)
def find_difference(data1, data2, id_columns):
cols = data1.columns
data_difference = data1.subtract(data2) # First you find the difference in rows
data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
# you join the differences with the second dataframe
return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))
The results will be somethong like that
Let me know if It meets your requirements
โ12-26-2023 10:32 AM
How to print only what columns changed instead of displaying all columns?
I just want to print for every row PRIMARY Key (ID) and only column names that changed in the second data frame as comma separated.
โ04-28-2022 01:53 AM
@vinita shindeโ are you Cracked this Code?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group