<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Data comparison in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100115#M40190</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/99096"&gt;@Frustrated_DE&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I don't know if that's what you're looking for, but maybe you can use set operators to compare dataframes ( intersect, except). If both dataframes have the same schema, except operator gives you difference between two sets of data. Intersect will return data that is common in both datasets.&lt;/P&gt;&lt;P&gt;Set operators are pretty handy when it comes to data quality validation.&lt;/P&gt;</description>
    <pubDate>Tue, 26 Nov 2024 14:59:46 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2024-11-26T14:59:46Z</dc:date>
    <item>
      <title>Data comparison</title>
      <link>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100103#M40185</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;Are there any tools within Databricks for large volume data comparisons, I appreciate there's methods for dataframe comparisons for unit testing (assertDataFrameEqual) but it is my understanding these are for testing transformations on smallish data. I have sizeable datasets that I would like to compare to ensure the values are equal before starting another pipeline and hoping to find an efficient way of undertaking this exercise. Any thoughts appreciated.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2024 14:22:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100103#M40185</guid>
      <dc:creator>Frustrated_DE</dc:creator>
      <dc:date>2024-11-26T14:22:09Z</dc:date>
    </item>
    <item>
      <title>Re: Data comparison</title>
      <link>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100115#M40190</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/99096"&gt;@Frustrated_DE&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I don't know if that's what you're looking for, but maybe you can use set operators to compare dataframes ( intersect, except). If both dataframes have the same schema, except operator gives you difference between two sets of data. Intersect will return data that is common in both datasets.&lt;/P&gt;&lt;P&gt;Set operators are pretty handy when it comes to data quality validation.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2024 14:59:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100115#M40190</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-11-26T14:59:46Z</dc:date>
    </item>
    <item>
      <title>Re: Data comparison</title>
      <link>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100120#M40192</link>
      <description>&lt;P&gt;Thanks Szymon, I will give these a try!&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2024 15:15:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100120#M40192</guid>
      <dc:creator>Frustrated_DE</dc:creator>
      <dc:date>2024-11-26T15:15:59Z</dc:date>
    </item>
    <item>
      <title>Re: Data comparison</title>
      <link>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100160#M40206</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Borrowed from LinkedIn, here is a SQL query you can use to compare two tables (or dataframes)&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;with
hash_src as ( select hash(*) as hash_val from my.source.table ),
hash_tgt as ( select hash(*) as hash_val from my.target.table )

select sum(hash_val) ^ AVG(hash_val)::int ^ MIN(hash_val) ^ MAX(hash_val) as hash_val from hash_src
union
select sum(hash_val) ^ AVG(hash_val)::int ^ MIN(hash_val) ^ MAX(hash_val) as hash_val from hash_tgt&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN&gt;If you get one row back ... the tables are the same.&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;If you get two rows back ... they're different.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2024 20:49:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100160#M40206</guid>
      <dc:creator>cgrant</dc:creator>
      <dc:date>2024-11-26T20:49:45Z</dc:date>
    </item>
    <item>
      <title>Re: Data comparison</title>
      <link>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100272#M40251</link>
      <description>&lt;P&gt;Thanks &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/33816"&gt;@cgrant&lt;/a&gt;&amp;nbsp;for sharing! Quite clever trick:)&lt;/P&gt;</description>
      <pubDate>Wed, 27 Nov 2024 18:15:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-comparison/m-p/100272#M40251</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-11-27T18:15:54Z</dc:date>
    </item>
  </channel>
</rss>

