03-28-2022 04:30 PM
I am trying to check whether a certain datapoint exists in multiple locations.
This is what my table looks like:
I am checking whether the same datapoint is in two locations. The idea is that this datapoint should exist in BOTH locations, and be counted only once if it is flagged in both. The end result should look like this:
Thanks in advance
03-29-2022 01:48 AM
I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.
Can datapoint have more than 2 locations?
I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.
The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.
03-28-2022 11:48 PM
you mean like an inner join on 2 dataframes?
03-29-2022 06:40 PM
It might be. That is what I have been doing, but it has not worked for me. So I'm looking for other options that may be more beneficial for cross-checking multiple locations at once. However, I am still open to the suggestion if I can make it succeed.
03-29-2022 01:48 AM
I think the example is too short to understand entirely (as in the source table, everything is distinct, and in the destination, the table count has a significant number) - please update both tables so the result will have a count based on the source.
Can datapoint have more than 2 locations?
I bet that the way is to create a copy of the dataframe and then join them together on a datapoint. Then, in the next step, filter, group, and count.
The kind of join used depends on the logic needed. For example, it can be join, inner join, but also intersect or intersectAll.
03-29-2022 06:50 PM
Here is a more accurate picture of my table:
Yes, a datapoint can have more than 2 locations. There are duplicate datapoints sharing the same location.
Can you please clarify on what you mean by updating the count based on the source? I am trying to achieve the second table in a way that it only counts the datapoints if they go to both locations, and if there are duplicates, it only counts the datapoint ONCE rather than both times
03-29-2022 06:48 AM
Okay, so what you're trying to do (probably) is solved by running a join using the 'Datapoints' as the index on the same dataframe. It's probably not efficient, but you'd do something like:
df2 = df
df_joined = df.join(other=df2, on='Datapoints', how='Left').selectExpr('df.Location as `Location A`, 'df2.Location as `LocationB`')
df_joined.groupBy('Location A', 'Location B').agg(count('Location B').alias('Count'))
I'm not sure this syntax works exactly as is, since I've never tried joining a table to itself.
03-29-2022 06:52 PM
Can you please clarify what language this is? I am trying to perform the cross-check with SQL
03-30-2022 06:08 AM
@Viral Barot , It's Python. The SQL syntax can be inferred from the above. df.join is just SQL's JOIN. selectExpr just runs an SQL SELECT expression. groupBy is just an SQL GROUP BY. alias is exquivalent to using AS. etc.
04-26-2022 02:58 PM
Thank you very much for your explanation
04-26-2022 02:58 PM
Hi,
Thank you very much for following up.
I no longer need assistance with this issue.
04-27-2022 03:47 PM
Hi, I have marked an answer as best
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group