topic Re: Generate Group Id for similar deduplicate values of a dataframe column. in Data Engineering

Generate Group Id for similar deduplicate values of a dataframe column.

Adig — Wed, 23 Nov 2022 06:37:56 GMT

Inupt DataFrame

'''

KeyName KeyCompare Source

PapasMrtemis PapasMrtemis S1

PapasMrtemis Pappas, Mrtemis S1

Pappas, Mrtemis PapasMrtemis S2

Pappas, Mrtemis Pappas, Mrtemis S2

Micheal Micheal S1

RCore Core S1

RCore Core,R S2

'''

Names are coming from the different source after doing a union those applied fuzzy match on it. now irrespective of sources need a group Id for similar values.

I want to use pyspark.

Output should be like below.

'''

KeyName KeyCompare Source KeyId

PapasMrtemis PapasMrtemis S1 1

PapasMrtemis Pappas, Mrtemis S1 1

Pappas, Mrtemis PapasMrtemis S2 1

Pappas, Mrtemis Pappas, Mrtemis S2 1

Micheal Micheal S1 2

RCore Core S1 3

RCore Core,R S2 3

'''

Re: Generate Group Id for similar deduplicate values of a dataframe column.

Unforgiven — Wed, 23 Nov 2022 07:23:56 GMT

https://sparkbyexamples.com/pyspark/pyspark-distinct-to-drop-duplicates/

refer this link above may match with your concern. hope this can make and help in this case

Re: Generate Group Id for similar deduplicate values of a dataframe column.

Ajay-Pandey — Wed, 23 Nov 2022 08:36:15 GMT

Please refer https://www.geeksforgeeks.org/how-to-count-unique-id-after-groupby-in-pyspark-dataframe/ this link this might help you

Re: Generate Group Id for similar deduplicate values of a dataframe column.

UmaMahesh1 — Wed, 23 Nov 2022 13:43:48 GMT

Hi @Adi dev ,

Your requirement can be easily achieved by using a dense_rank() function.

As your data looks a bit confusing, creating a sample data on my own and assigning a group id based on KeyName. If you want to assign group id based on other column/s, you can add those to ORDER BY clause accordingly.

Input :

Output:

Hope this helps..Cheers.

Re: Generate Group Id for similar deduplicate values of a dataframe column.

Own — Tue, 29 Nov 2022 21:39:02 GMT

Use hash function on the retrieved columns to generate a unique hash value on the basis of the value in these columns. If the same values will be there in two rows then same hash will be generated by the function and then system won't allow it. Hence, you will be able to get unique for each record deduplicated.

Re: Generate Group Id for similar deduplicate values of a dataframe column.

VaibB — Fri, 02 Dec 2022 20:22:07 GMT

Create a UDF where you pass all the fields as Input that you need to take into consideration for a unique row.
Create a list by splitting based on ' ' or ','.
sort the list and
concat all the elements of the list to derive "new field".
Calculate dense_rank based on the derived field .
Remove "new field".

Re: Generate Group Id for similar deduplicate values of a dataframe column.

rafaelpoyiadzi — Thu, 22 Jan 2026 22:07:26 GMT

Hey. We’ve run into similar deduplication problems before. If the name differences are pretty minor (punctuation, spacing, small typos), fuzzy string matching can usually get you most of the way there. That kind of similarity-based clustering works fine for straightforward cases.

Once names start to vary more though (abbreviations, reordered components, nicknames, or spellings that don’t look alike character-wise), fuzzy matching starts to fall apart because it’s only comparing characters, not meaning. That’s where semantic understanding helps.

In practice, fuzzy matching missed things like “A. Butoi” vs “Alexandra Butoi”, while a semantic approach did much better overall. You can read more in the FutureSearch case study here: https://futuresearch.ai/researcher-dedupe-case-study/