topic Re: Data comparison in Data Engineering

Data comparison

Frustrated_DE — Tue, 26 Nov 2024 14:22:09 GMT

Hi,

Are there any tools within Databricks for large volume data comparisons, I appreciate there's methods for dataframe comparisons for unit testing (assertDataFrameEqual) but it is my understanding these are for testing transformations on smallish data. I have sizeable datasets that I would like to compare to ensure the values are equal before starting another pipeline and hoping to find an efficient way of undertaking this exercise. Any thoughts appreciated.

Thanks

Re: Data comparison

szymon_dybczak — Tue, 26 Nov 2024 14:59:46 GMT

Hi @Frustrated_DE ,

I don't know if that's what you're looking for, but maybe you can use set operators to compare dataframes ( intersect, except). If both dataframes have the same schema, except operator gives you difference between two sets of data. Intersect will return data that is common in both datasets.

Set operators are pretty handy when it comes to data quality validation.

Re: Data comparison

Frustrated_DE — Tue, 26 Nov 2024 15:15:59 GMT

Thanks Szymon, I will give these a try!

Re: Data comparison

cgrant — Tue, 26 Nov 2024 20:49:45 GMT

Borrowed from LinkedIn, here is a SQL query you can use to compare two tables (or dataframes)

with hash_src as ( select hash(*) as hash_val from my.source.table ), hash_tgt as ( select hash(*) as hash_val from my.target.table ) select sum(hash_val) ^ AVG(hash_val)::int ^ MIN(hash_val) ^ MAX(hash_val) as hash_val from hash_src union select sum(hash_val) ^ AVG(hash_val)::int ^ MIN(hash_val) ^ MAX(hash_val) as hash_val from hash_tgt

If you get one row back ... the tables are the same.
If you get two rows back ... they're different.

Re: Data comparison

szymon_dybczak — Wed, 27 Nov 2024 18:15:54 GMT

Thanks @cgrant for sharing! Quite clever trick:)