cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to perform a cross-check for data in multiple columns in same table?

818674
New Contributor III

I am trying to check whether a certain datapoint exists in multiple locations.

This is what my table looks like:

TableI am checking whether the same datapoint is in two locations. The idea is that this datapoint should exist in BOTH locations, and be counted only once if it is flagged in both. The end result should look like this:

Examples of Results for Cross-CheckingThanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.

Can datapoint have more than 2 locations?

I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.

The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.

View solution in original post

10 REPLIES 10

-werners-
Esteemed Contributor III

you mean like an inner join on 2 dataframes?

818674
New Contributor III

It might be. That is what I have been doing, but it has not worked for me. So I'm looking for other options that may be more beneficial for cross-checking multiple locations at once. However, I am still open to the suggestion if I can make it succeed.

Hubert-Dudek
Esteemed Contributor III

I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.

Can datapoint have more than 2 locations?

I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.

The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.

818674
New Contributor III

Here is a more accurate picture of my table:

imageYes, a datapoint can have more than 2 locations. There are duplicate datapoints sharing the same location.

Can you please clarify on what you mean by updating the count based on the source? I am trying to achieve the second table in a way that it only counts the datapoints if they go to both locations, and if there are duplicates, it only counts the datapoint ONCE rather than both times

Mr__E
Contributor II

Okay, so what you're trying to do (probably) is solved by running a join using the 'Datapoints' as the index on the same dataframe. It's probably not efficient, but you'd do something like:

df2 = df
df_joined = df.join(other=df2, on='Datapoints', how='Left').selectExpr('df.Location as `Location A`, 'df2.Location as `LocationB`')
df_joined.groupBy('Location A', 'Location B').agg(count('Location B').alias('Count'))

 I'm not sure this syntax works exactly as is, since I've never tried joining a table to itself.

818674
New Contributor III

Can you please clarify what language this is? I am trying to perform the cross-check with SQL

Mr__E
Contributor II

@Viral Barot​ , It's Python. The SQL syntax can be inferred from the above. df.join is just SQL's JOIN. selectExpr just runs an SQL SELECT expression. groupBy is just an SQL GROUP BY. alias is exquivalent to using AS. etc.

818674
New Contributor III

Thank you very much for your explanation

818674
New Contributor III

Hi,

Thank you very much for following up.

I no longer need assistance with this issue.

818674
New Contributor III

Hi, I have marked an answer as best

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group