Databricks Community

818674 · ‎03-28-2022

I am trying to check whether a certain datapoint exists in multiple locations.

This is what my table looks like:

I am checking whether the same datapoint is in two locations. The idea is that this datapoint should exist in BOTH locations, and be counted only once if it is flagged in both. The end result should look like this:

Thanks in advance

Hubert-Dudek · ‎03-29-2022

I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.

Can datapoint have more than 2 locations?

I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.

The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.

View solution in original post

-werners- · ‎03-28-2022

you mean like an inner join on 2 dataframes?

818674 · ‎03-29-2022

It might be. That is what I have been doing, but it has not worked for me. So I'm looking for other options that may be more beneficial for cross-checking multiple locations at once. However, I am still open to the suggestion if I can make it succeed.

Hubert-Dudek · ‎03-29-2022

I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.

Can datapoint have more than 2 locations?

I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.

The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.

818674 · ‎03-29-2022

Here is a more accurate picture of my table:

Yes, a datapoint can have more than 2 locations. There are duplicate datapoints sharing the same location.

Can you please clarify on what you mean by updating the count based on the source? I am trying to achieve the second table in a way that it only counts the datapoints if they go to both locations, and if there are duplicates, it only counts the datapoint ONCE rather than both times

Mr__E · ‎03-29-2022

Okay, so what you're trying to do (probably) is solved by running a join using the 'Datapoints' as the index on the same dataframe. It's probably not efficient, but you'd do something like:

df2 = df
df_joined = df.join(other=df2, on='Datapoints', how='Left').selectExpr('df.Location as `Location A`, 'df2.Location as `LocationB`')
df_joined.groupBy('Location A', 'Location B').agg(count('Location B').alias('Count'))

I'm not sure this syntax works exactly as is, since I've never tried joining a table to itself.

818674 · ‎03-29-2022

Can you please clarify what language this is? I am trying to perform the cross-check with SQL

Mr__E · ‎03-30-2022

@Viral Barot , It's Python. The SQL syntax can be inferred from the above. df.join is just SQL's JOIN. selectExpr just runs an SQL SELECT expression. groupBy is just an SQL GROUP BY. alias is exquivalent to using AS. etc.