Databricks

Databricks_POC · ‎12-20-2021

Pholo · ‎01-11-2022

Hi @vinita shinde

I don't know if It meets your requirement, but you can try something like that

import pyspark.sql.functions as F
from pyspark.sql import Row
 
def is_different(col1, col2, cols):
  if col2 in cols:
    return F.col(col1) == F.col(col2)
  else:
    return F.col(col1).alias(col1)
 
def find_difference(data1, data2, id_columns):
  cols = data1.columns
  data_difference = data1.subtract(data2) # First you find the difference in rows
  data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
  # you join the differences with the second dataframe
  return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
 
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))

The results will be somethong like that

Let me know if It meets your requirements

View solution in original post

Kaniz · ‎12-20-2021

Hi @Databricks POC! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Anonymous · ‎12-20-2021

Are you trying to do change data capture between an older and newer version of the same table?

Databricks_POC · ‎12-21-2021

No, I wish to compare two tables. Same Scenario as in case of Minus/Except query. However apart from the mismatched rows, I wish to also know which are those columns leading to the difference.

Pholo · ‎01-11-2022

Hi @vinita shinde

I don't know if It meets your requirement, but you can try something like that

import pyspark.sql.functions as F
from pyspark.sql import Row
 
def is_different(col1, col2, cols):
  if col2 in cols:
    return F.col(col1) == F.col(col2)
  else:
    return F.col(col1).alias(col1)
 
def find_difference(data1, data2, id_columns):
  cols = data1.columns
  data_difference = data1.subtract(data2) # First you find the difference in rows
  data_difference = data_difference.join(data2.select(*[F.col(i).alias(f'b_{i}') if i not in id_columns else F.col(i).alias(i) for i in cols]), on = id_columns, how = 'left')
  # you join the differences with the second dataframe
  return data_difference.select(*[is_different(x, f"b_{x}",data_difference.columns) for x in cols]) # then you find out which columns cause the differences
 
# df1_spark first dataframe
# df2_spark second datafame
# third option is a list of ids of the Row
display(find_difference(df1_spark, df2_spark, ['ID'] ))

The results will be somethong like that

Let me know if It meets your requirements

Enthusiastic_Da · ‎12-26-2023

How to print only what columns changed instead of displaying all columns?

I just want to print for every row PRIMARY Key (ID) and only column names that changed in the second data frame as comma separated.

bhargavi1 · ‎04-28-2022

@vinita shinde are you Cracked this Code?

Databricks

I want to compare two data frames. In output I wish to see unmatched Rows and the columns identified leading to the differences.

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI