<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Generate Group Id for similar deduplicate values of a dataframe column. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21062#M14300</link>
    <description>&lt;P&gt;Please refer &lt;A href="https://www.geeksforgeeks.org/how-to-count-unique-id-after-groupby-in-pyspark-dataframe/" alt="https://www.geeksforgeeks.org/how-to-count-unique-id-after-groupby-in-pyspark-dataframe/" target="_blank"&gt;https://www.geeksforgeeks.org/how-to-count-unique-id-after-groupby-in-pyspark-dataframe/&lt;/A&gt;     this link this might help you&lt;/P&gt;</description>
    <pubDate>Wed, 23 Nov 2022 08:36:15 GMT</pubDate>
    <dc:creator>Ajay-Pandey</dc:creator>
    <dc:date>2022-11-23T08:36:15Z</dc:date>
    <item>
      <title>Generate Group Id for similar deduplicate values of a dataframe column.</title>
      <link>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21060#M14298</link>
      <description>&lt;P&gt;Inupt DataFrame&lt;/P&gt;&lt;P&gt;'''&lt;/P&gt;&lt;P&gt;KeyName               KeyCompare              Source&lt;/P&gt;&lt;P&gt;PapasMrtemis       PapasMrtemis            S1&lt;/P&gt;&lt;P&gt;PapasMrtemis       Pappas, Mrtemis        S1&lt;/P&gt;&lt;P&gt;Pappas, Mrtemis    PapasMrtemis           S2&lt;/P&gt;&lt;P&gt;Pappas, Mrtemis    Pappas, Mrtemis       S2&lt;/P&gt;&lt;P&gt;Micheal                   Micheal                       S1&lt;/P&gt;&lt;P&gt;RCore                     Core                             S1&lt;/P&gt;&lt;P&gt;RCore                     Core,R                          S2&lt;/P&gt;&lt;P&gt;'''&lt;/P&gt;&lt;P&gt;Names are coming from the different source after doing a union those applied fuzzy match on it. now irrespective of sources need a group Id for similar values.&lt;/P&gt;&lt;P&gt;I want to use pyspark.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Output should be like below.&lt;/P&gt;&lt;P&gt;'''&lt;/P&gt;&lt;P&gt;KeyName                KeyCompare               Source      KeyId&lt;/P&gt;&lt;P&gt;PapasMrtemis       PapasMrtemis             S1               1&lt;/P&gt;&lt;P&gt;PapasMrtemis       Pappas, Mrtemis        S1               1  &lt;/P&gt;&lt;P&gt;Pappas, Mrtemis  PapasMrtemis             S2              1 &lt;/P&gt;&lt;P&gt;Pappas, Mrtemis  Pappas, Mrtemis         S2              1 &lt;/P&gt;&lt;P&gt;Micheal                 Micheal                        S1               2&lt;/P&gt;&lt;P&gt;RCore                   Core                             S1                3&lt;/P&gt;&lt;P&gt;RCore                   Core,R                          S2               3&lt;/P&gt;&lt;P&gt;'''&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2022 06:37:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21060#M14298</guid>
      <dc:creator>Adig</dc:creator>
      <dc:date>2022-11-23T06:37:56Z</dc:date>
    </item>
    <item>
      <title>Re: Generate Group Id for similar deduplicate values of a dataframe column.</title>
      <link>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21061#M14299</link>
      <description>&lt;P&gt;&lt;A href="https://sparkbyexamples.com/pyspark/pyspark-distinct-to-drop-duplicates/" target="test_blank"&gt;https://sparkbyexamples.com/pyspark/pyspark-distinct-to-drop-duplicates/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;refer this link above may match with your concern. hope this can make and help in this case&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2022 07:23:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21061#M14299</guid>
      <dc:creator>Unforgiven</dc:creator>
      <dc:date>2022-11-23T07:23:56Z</dc:date>
    </item>
    <item>
      <title>Re: Generate Group Id for similar deduplicate values of a dataframe column.</title>
      <link>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21062#M14300</link>
      <description>&lt;P&gt;Please refer &lt;A href="https://www.geeksforgeeks.org/how-to-count-unique-id-after-groupby-in-pyspark-dataframe/" alt="https://www.geeksforgeeks.org/how-to-count-unique-id-after-groupby-in-pyspark-dataframe/" target="_blank"&gt;https://www.geeksforgeeks.org/how-to-count-unique-id-after-groupby-in-pyspark-dataframe/&lt;/A&gt;     this link this might help you&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2022 08:36:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21062#M14300</guid>
      <dc:creator>Ajay-Pandey</dc:creator>
      <dc:date>2022-11-23T08:36:15Z</dc:date>
    </item>
    <item>
      <title>Re: Generate Group Id for similar deduplicate values of a dataframe column.</title>
      <link>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21063#M14301</link>
      <description>&lt;P&gt;Hi @Adi dev​&amp;nbsp;, &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Your requirement can be easily achieved by using a dense_rank() function. &lt;/P&gt;&lt;P&gt;As your data looks a bit confusing, creating a sample data on my own and assigning a group id based on KeyName. If you want to assign group id based on other column/s, you can add those to ORDER BY clause accordingly. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Input : &lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Input"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1133i66F918A34ACDDEF7/image-size/large?v=v2&amp;amp;px=999" role="button" title="Input" alt="Input" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Output: &lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Output"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1132iA73677AED1E352B2/image-size/large?v=v2&amp;amp;px=999" role="button" title="Output" alt="Output" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hope this helps..Cheers.&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2022 13:43:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21063#M14301</guid>
      <dc:creator>UmaMahesh1</dc:creator>
      <dc:date>2022-11-23T13:43:48Z</dc:date>
    </item>
    <item>
      <title>Re: Generate Group Id for similar deduplicate values of a dataframe column.</title>
      <link>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21065#M14303</link>
      <description>&lt;P&gt;Use hash function on the retrieved columns to generate a unique hash value on the basis of the value in these columns. If the same values will be there in two rows then same hash will be generated by the function and then system won't allow it. Hence, you will be able to get unique for each record deduplicated.​&lt;/P&gt;</description>
      <pubDate>Tue, 29 Nov 2022 21:39:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21065#M14303</guid>
      <dc:creator>Own</dc:creator>
      <dc:date>2022-11-29T21:39:02Z</dc:date>
    </item>
    <item>
      <title>Re: Generate Group Id for similar deduplicate values of a dataframe column.</title>
      <link>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21066#M14304</link>
      <description>&lt;OL&gt;&lt;LI&gt;Create a UDF where you pass all the fields as Input that you need to take into consideration for a unique row. &lt;/LI&gt;&lt;LI&gt;Create a list by splitting based on ' ' or ','. &lt;/LI&gt;&lt;LI&gt;sort the list and &lt;/LI&gt;&lt;LI&gt;concat all the elements of the list to derive "new field". &lt;/LI&gt;&lt;LI&gt;Calculate dense_rank based on the derived field . &lt;/LI&gt;&lt;LI&gt;Remove "new field". &lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Fri, 02 Dec 2022 20:22:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/21066#M14304</guid>
      <dc:creator>VaibB</dc:creator>
      <dc:date>2022-12-02T20:22:07Z</dc:date>
    </item>
    <item>
      <title>Re: Generate Group Id for similar deduplicate values of a dataframe column.</title>
      <link>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/144943#M52420</link>
      <description>&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Hey. We’ve run into similar deduplication problems before.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;If the name differences are pretty minor (punctuation, spacing, small typos), fuzzy string matching can usually get you most of the way there. That kind of similarity-based clustering works fine for straightforward cases.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;Once names start to vary more though (abbreviations, reordered components, nicknames, or spellings that don’t look alike character-wise), fuzzy matching starts to fall apart because it’s only comparing characters, not meaning. That’s where semantic understanding helps.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;In practice, fuzzy matching missed things like “A. Butoi” vs “Alexandra Butoi”, while a semantic approach did much better overall. You can read more in the FutureSearch case study here: &lt;A href="https://futuresearch.ai/researcher-dedupe-case-study/" target="_blank"&gt;https://futuresearch.ai/researcher-dedupe-case-study/&lt;/A&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 22 Jan 2026 22:07:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/generate-group-id-for-similar-deduplicate-values-of-a-dataframe/m-p/144943#M52420</guid>
      <dc:creator>rafaelpoyiadzi</dc:creator>
      <dc:date>2026-01-22T22:07:26Z</dc:date>
    </item>
  </channel>
</rss>

