cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Data comparison

Frustrated_DE
New Contributor III

Hi,

   Are there any tools within Databricks for large volume data comparisons, I appreciate there's methods for dataframe comparisons for unit testing (assertDataFrameEqual) but it is my understanding these are for testing transformations on smallish data. I have sizeable datasets that I would like to compare to ensure the values are equal before starting another pipeline and hoping to find an efficient way of undertaking this exercise. Any thoughts appreciated.

                    Thanks

4 REPLIES 4

szymon_dybczak
Contributor III

Hi @Frustrated_DE ,

I don't know if that's what you're looking for, but maybe you can use set operators to compare dataframes ( intersect, except). If both dataframes have the same schema, except operator gives you difference between two sets of data. Intersect will return data that is common in both datasets.

Set operators are pretty handy when it comes to data quality validation.

Frustrated_DE
New Contributor III

Thanks Szymon, I will give these a try!

cgrant
Databricks Employee
Databricks Employee

Borrowed from LinkedIn, here is a SQL query you can use to compare two tables (or dataframes)

with
hash_src as ( select hash(*) as hash_val from my.source.table ),
hash_tgt as ( select hash(*) as hash_val from my.target.table )

select sum(hash_val) ^ AVG(hash_val)::int ^ MIN(hash_val) ^ MAX(hash_val) as hash_val from hash_src
union
select sum(hash_val) ^ AVG(hash_val)::int ^ MIN(hash_val) ^ MAX(hash_val) as hash_val from hash_tgt

If you get one row back ... the tables are the same.
If you get two rows back ... they're different.

Thanks @cgrant for sharing! Quite clever trick:)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group