Data comparison
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-26-2024 06:20 AM - edited 11-26-2024 06:22 AM
Hi,
Are there any tools within Databricks for large volume data comparisons, I appreciate there's methods for dataframe comparisons for unit testing (assertDataFrameEqual) but it is my understanding these are for testing transformations on smallish data. I have sizeable datasets that I would like to compare to ensure the values are equal before starting another pipeline and hoping to find an efficient way of undertaking this exercise. Any thoughts appreciated.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-26-2024 06:58 AM - edited 11-26-2024 06:59 AM
Hi @Frustrated_DE ,
I don't know if that's what you're looking for, but maybe you can use set operators to compare dataframes ( intersect, except). If both dataframes have the same schema, except operator gives you difference between two sets of data. Intersect will return data that is common in both datasets.
Set operators are pretty handy when it comes to data quality validation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-26-2024 07:15 AM
Thanks Szymon, I will give these a try!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-26-2024 12:49 PM
Borrowed from LinkedIn, here is a SQL query you can use to compare two tables (or dataframes)
with
hash_src as ( select hash(*) as hash_val from my.source.table ),
hash_tgt as ( select hash(*) as hash_val from my.target.table )
select sum(hash_val) ^ AVG(hash_val)::int ^ MIN(hash_val) ^ MAX(hash_val) as hash_val from hash_src
union
select sum(hash_val) ^ AVG(hash_val)::int ^ MIN(hash_val) ^ MAX(hash_val) as hash_val from hash_tgt
If you get one row back ... the tables are the same.
If you get two rows back ... they're different.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-27-2024 10:15 AM
Thanks @cgrant for sharing! Quite clever trick:)
data:image/s3,"s3://crabby-images/cb5bb/cb5bb73aed1093bf2bbc88d029c5de02e8c5cfc3" alt=""
data:image/s3,"s3://crabby-images/cb5bb/cb5bb73aed1093bf2bbc88d029c5de02e8c5cfc3" alt=""