<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to perform a cross-check for data in multiple columns in same table? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24469#M17012</link>
    <description>&lt;P&gt;Hi, I have marked an answer as best&lt;/P&gt;</description>
    <pubDate>Wed, 27 Apr 2022 22:47:51 GMT</pubDate>
    <dc:creator>818674</dc:creator>
    <dc:date>2022-04-27T22:47:51Z</dc:date>
    <item>
      <title>How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24457#M17000</link>
      <description>&lt;P&gt;I am trying to check whether a certain datapoint exists in multiple locations.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is what my table looks like:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Table"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1981iEFE4097B73A4792C/image-size/large?v=v2&amp;amp;px=999" role="button" title="Table" alt="Table" /&gt;&lt;/span&gt;I am checking whether the same datapoint is in two locations. The idea is that this datapoint should exist in BOTH locations, and be counted only once if it is flagged in both. The end result should look like this:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Examples of Results for Cross-Checking"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1986iDAAA9F6B38DA03C2/image-size/large?v=v2&amp;amp;px=999" role="button" title="Examples of Results for Cross-Checking" alt="Examples of Results for Cross-Checking" /&gt;&lt;/span&gt;Thanks in advance&lt;/P&gt;</description>
      <pubDate>Mon, 28 Mar 2022 23:30:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24457#M17000</guid>
      <dc:creator>818674</dc:creator>
      <dc:date>2022-03-28T23:30:11Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24458#M17001</link>
      <description>&lt;P&gt;you mean like an inner join on 2 dataframes?&lt;/P&gt;</description>
      <pubDate>Tue, 29 Mar 2022 06:48:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24458#M17001</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-03-29T06:48:27Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24459#M17002</link>
      <description>&lt;P&gt;I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can datapoint have more than 2 locations?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Mar 2022 08:48:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24459#M17002</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-03-29T08:48:06Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24460#M17003</link>
      <description>&lt;P&gt;Okay, so what you're trying to do (probably) is solved by running a join using the 'Datapoints' as the index on the same dataframe. It's probably not efficient, but you'd do something like:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df2 = df
df_joined = df.join(other=df2, on='Datapoints', how='Left').selectExpr('df.Location as `Location A`, 'df2.Location as `LocationB`')
df_joined.groupBy('Location A', 'Location B').agg(count('Location B').alias('Count'))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;I'm not sure this syntax works exactly as is, since I've never tried joining a table to itself.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Mar 2022 13:48:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24460#M17003</guid>
      <dc:creator>Mr__E</dc:creator>
      <dc:date>2022-03-29T13:48:52Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24461#M17004</link>
      <description>&lt;P&gt;It might be. That is what I have been doing, but it has not worked for me. So I'm looking for other options that may be more beneficial for cross-checking multiple locations at once. However, I am still open to the suggestion if I can make it succeed.&lt;/P&gt;</description>
      <pubDate>Wed, 30 Mar 2022 01:40:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24461#M17004</guid>
      <dc:creator>818674</dc:creator>
      <dc:date>2022-03-30T01:40:49Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24462#M17005</link>
      <description>&lt;P&gt;Here is a more accurate picture of my table:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1980iD5F44AA3075760BA/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;Yes, a datapoint can have more than 2 locations. There are duplicate datapoints sharing the same location. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can you please clarify on what you mean by updating the count based on the source? I am trying to achieve the second table in a way that it only counts the datapoints if they go to both locations, and if there are duplicates, it only counts the datapoint ONCE rather than both times&lt;/P&gt;</description>
      <pubDate>Wed, 30 Mar 2022 01:50:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24462#M17005</guid>
      <dc:creator>818674</dc:creator>
      <dc:date>2022-03-30T01:50:14Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24463#M17006</link>
      <description>&lt;P&gt;Can you please clarify what language this is? I am trying to perform the cross-check with SQL&lt;/P&gt;</description>
      <pubDate>Wed, 30 Mar 2022 01:52:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24463#M17006</guid>
      <dc:creator>818674</dc:creator>
      <dc:date>2022-03-30T01:52:03Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24464#M17007</link>
      <description>&lt;P&gt;@Viral Barot​&amp;nbsp;, It's Python. The SQL syntax can be inferred from the above. df.join is just SQL's JOIN. selectExpr just runs an SQL SELECT expression. groupBy is just an SQL GROUP BY. alias is exquivalent to using AS. etc.&lt;/P&gt;</description>
      <pubDate>Wed, 30 Mar 2022 13:08:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24464#M17007</guid>
      <dc:creator>Mr__E</dc:creator>
      <dc:date>2022-03-30T13:08:03Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24466#M17009</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you very much for following up. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I no longer need assistance with this issue.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Apr 2022 21:58:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24466#M17009</guid>
      <dc:creator>818674</dc:creator>
      <dc:date>2022-04-26T21:58:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24467#M17010</link>
      <description>&lt;P&gt;Thank you very much for your explanation&lt;/P&gt;</description>
      <pubDate>Tue, 26 Apr 2022 21:58:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24467#M17010</guid>
      <dc:creator>818674</dc:creator>
      <dc:date>2022-04-26T21:58:27Z</dc:date>
    </item>
    <item>
      <title>Re: How to perform a cross-check for data in multiple columns in same table?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24469#M17012</link>
      <description>&lt;P&gt;Hi, I have marked an answer as best&lt;/P&gt;</description>
      <pubDate>Wed, 27 Apr 2022 22:47:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-perform-a-cross-check-for-data-in-multiple-columns-in/m-p/24469#M17012</guid>
      <dc:creator>818674</dc:creator>
      <dc:date>2022-04-27T22:47:51Z</dc:date>
    </item>
  </channel>
</rss>

