cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Generate Group Id for similar deduplicate values of a dataframe column.

Adig
New Contributor III

Inupt DataFrame

'''

KeyName KeyCompare Source

PapasMrtemis PapasMrtemis S1

PapasMrtemis Pappas, Mrtemis S1

Pappas, Mrtemis PapasMrtemis S2

Pappas, Mrtemis Pappas, Mrtemis S2

Micheal Micheal S1

RCore Core S1

RCore Core,R S2

'''

Names are coming from the different source after doing a union those applied fuzzy match on it. now irrespective of sources need a group Id for similar values.

I want to use pyspark.

Output should be like below.

'''

KeyName KeyCompare Source KeyId

PapasMrtemis PapasMrtemis S1 1

PapasMrtemis Pappas, Mrtemis S1 1

Pappas, Mrtemis PapasMrtemis S2 1

Pappas, Mrtemis Pappas, Mrtemis S2 1

Micheal Micheal S1 2

RCore Core S1 3

RCore Core,R S2 3

'''

6 REPLIES 6

Unforgiven
Valued Contributor III

https://sparkbyexamples.com/pyspark/pyspark-distinct-to-drop-duplicates/

refer this link above may match with your concern. hope this can make and help in this case

Ajay-Pandey
Esteemed Contributor III

UmaMahesh1
Honored Contributor III

Hi @Adi dev​ ,

Your requirement can be easily achieved by using a dense_rank() function.

As your data looks a bit confusing, creating a sample data on my own and assigning a group id based on KeyName. If you want to assign group id based on other column/s, you can add those to ORDER BY clause accordingly.

Input :

Input 

Output:

Output 

Hope this helps..Cheers.

Kaniz
Community Manager
Community Manager

Hi @Adi dev​ (Customer)​ , We haven’t heard from you since the last response from @Number Four I am​​​, @Ajay Pandey​ and @Uma Maheswara Rao Desula​ and I was checking back to see if their suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Own
Contributor

Use hash function on the retrieved columns to generate a unique hash value on the basis of the value in these columns. If the same values will be there in two rows then same hash will be generated by the function and then system won't allow it. Hence, you will be able to get unique for each record deduplicated.​

VaibB
Contributor
  1. Create a UDF where you pass all the fields as Input that you need to take into consideration for a unique row.
  2. Create a list by splitting based on ' ' or ','.
  3. sort the list and
  4. concat all the elements of the list to derive "new field".
  5. Calculate dense_rank based on the derived field .
  6. Remove "new field".
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.