cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to perform a cross-check for data in multiple columns in same table?

818674
New Contributor III

I am trying to check whether a certain datapoint exists in multiple locations.

This is what my table looks like:

TableI am checking whether the same datapoint is in two locations. The idea is that this datapoint should exist in BOTH locations, and be counted only once if it is flagged in both. The end result should look like this:

Examples of Results for Cross-CheckingThanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.

Can datapoint have more than 2 locations?

I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.

The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.

View solution in original post

13 REPLIES 13

-werners-
Esteemed Contributor III

you mean like an inner join on 2 dataframes?

818674
New Contributor III

It might be. That is what I have been doing, but it has not worked for me. So I'm looking for other options that may be more beneficial for cross-checking multiple locations at once. However, I am still open to the suggestion if I can make it succeed.

Hubert-Dudek
Esteemed Contributor III

I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.

Can datapoint have more than 2 locations?

I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.

The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.

818674
New Contributor III

Here is a more accurate picture of my table:

imageYes, a datapoint can have more than 2 locations. There are duplicate datapoints sharing the same location.

Can you please clarify on what you mean by updating the count based on the source? I am trying to achieve the second table in a way that it only counts the datapoints if they go to both locations, and if there are duplicates, it only counts the datapoint ONCE rather than both times

Mr__E
Contributor II

Okay, so what you're trying to do (probably) is solved by running a join using the 'Datapoints' as the index on the same dataframe. It's probably not efficient, but you'd do something like:

df2 = df
df_joined = df.join(other=df2, on='Datapoints', how='Left').selectExpr('df.Location as `Location A`, 'df2.Location as `LocationB`')
df_joined.groupBy('Location A', 'Location B').agg(count('Location B').alias('Count'))

 I'm not sure this syntax works exactly as is, since I've never tried joining a table to itself.

818674
New Contributor III

Can you please clarify what language this is? I am trying to perform the cross-check with SQL

Mr__E
Contributor II

@Viral Barot​ , It's Python. The SQL syntax can be inferred from the above. df.join is just SQL's JOIN. selectExpr just runs an SQL SELECT expression. groupBy is just an SQL GROUP BY. alias is exquivalent to using AS. etc.

818674
New Contributor III

Thank you very much for your explanation

Kaniz
Community Manager
Community Manager

Hi @Viral Barot​ , Just a friendly follow-up. Do you still need help? Please let us know.

818674
New Contributor III

Hi,

Thank you very much for following up.

I no longer need assistance with this issue.

Kaniz
Community Manager
Community Manager

Hi @Viral Barot​ , Thank you for the quick update. Would you mind marking an answer as the best?

818674
New Contributor III

Hi, I have marked an answer as best

Kaniz
Community Manager
Community Manager

Thank you @Viral Barot​ . This way, you're helping the entire community on a larger scale.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.