<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Fuzzy text matching in Spark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29806#M21507</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You can use python libraries in Spark. I suggest using fuzzy-wuzzy for computing the similarities.&lt;/P&gt;
&lt;P&gt;Then you just need to join the client list with the internal dataset. If you wanted to make sure you tried every single client list against the internal dataset, then you can do a cartesian join. But there may be a better way to cut down the possibilities so you can use a more efficient join - such as assuming the internal dataset name starts with the same letter as the client list. You can even try multiple passes on the internal dataset and try more complicated logic each time. &lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 01 Apr 2016 17:04:29 GMT</pubDate>
    <dc:creator>vida</dc:creator>
    <dc:date>2016-04-01T17:04:29Z</dc:date>
    <item>
      <title>Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29805#M21506</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have a list of client provided data, a list of company names. &lt;/P&gt;
&lt;P&gt;I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark for accesing it. &lt;/P&gt;
&lt;P&gt;How could I go and match the client list? I was thinking in doing a matrix (RowMatrix) of N x D elements, n being the number of client elements and D being the length of the internal client list) and compute the similarities pair wise. &lt;/P&gt;
&lt;P&gt;How could I do this in Spark? Any help would be more than welcome.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Mar 2016 11:09:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29805#M21506</guid>
      <dc:creator>manugarri</dc:creator>
      <dc:date>2016-03-15T11:09:04Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29806#M21507</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You can use python libraries in Spark. I suggest using fuzzy-wuzzy for computing the similarities.&lt;/P&gt;
&lt;P&gt;Then you just need to join the client list with the internal dataset. If you wanted to make sure you tried every single client list against the internal dataset, then you can do a cartesian join. But there may be a better way to cut down the possibilities so you can use a more efficient join - such as assuming the internal dataset name starts with the same letter as the client list. You can even try multiple passes on the internal dataset and try more complicated logic each time. &lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Apr 2016 17:04:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29806#M21507</guid>
      <dc:creator>vida</dc:creator>
      <dc:date>2016-04-01T17:04:29Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29807#M21508</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I'm not aware of any solution out of the box to be able to do something like this but there are several talks that have been done on the subject which you can find below.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/" target="test_blank"&gt;https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://spark-summit.org/2014/talk/fuzzy-matching-with-spark" target="test_blank"&gt;https://spark-summit.org/2014/talk/fuzzy-matching-with-spark&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Apr 2016 17:05:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29807#M21508</guid>
      <dc:creator>Bill_Chambers</dc:creator>
      <dc:date>2016-04-01T17:05:51Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29808#M21509</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Yeah, those two examples (which are the top ones that appear on google) reference a talk which basically doesnt explain how to implement anything. &lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Apr 2016 17:37:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29808#M21509</guid>
      <dc:creator>manugarri</dc:creator>
      <dc:date>2016-04-01T17:37:36Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29809#M21510</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Curious if you ever found a workable solution to this. Your question is still one of the top hits when I Google it. We are facing a similar challenge, where we want to be able to fuzzy match high volume lists of individuals in HDFS / Hive. Thinking of creating something in PySpark, or implementing Elastic, but don't want to reinvent the wheel if there's something already out there. We need to standardize our data before matching as well, but that's another story.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Aug 2017 21:14:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29809#M21510</guid>
      <dc:creator>PaulExter</dc:creator>
      <dc:date>2017-08-09T21:14:27Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29810#M21511</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Like vida said, you can use python libraries to get text matching algorithms.&lt;/P&gt;
&lt;P&gt;You can even register the function and use it as a udf in SQL.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 10 Aug 2017 07:41:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29810#M21511</guid>
      <dc:creator>MatiasRotenberg</dc:creator>
      <dc:date>2017-08-10T07:41:49Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29811#M21512</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Matias, in my experience using python udfs is tremendously slow.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 10 Aug 2017 08:26:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29811#M21512</guid>
      <dc:creator>manugarri</dc:creator>
      <dc:date>2017-08-10T08:26:36Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29812#M21513</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;for those of you looking for a not very complicated solution, you can use the 2 native spark api Soundex and Levenshtein as your fuzzy matching algorithms.&lt;/P&gt;val joinedDF = accountDF.join( accountDF2, levenshtein(accountDF("name"), accountDF2("name")) &amp;lt; 3 &amp;amp;&amp;amp; (accountDF("id") !== accountDF2("id")) )
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;joinedDF.show &lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 19:01:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29812#M21513</guid>
      <dc:creator>hansonkx</dc:creator>
      <dc:date>2017-11-29T19:01:22Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29813#M21514</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;for those of you who are looking for a not too complicated solution, you can use the two built in spark api soundex and &lt;I&gt;levenshtein&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;&lt;I&gt;&lt;/I&gt;
&lt;PRE&gt;&lt;CODE&gt;val newDF = accountDF.join(
  accountDF2,
  levenshtein(accountDF("name"), accountDF2("name")) &amp;lt; 3 &amp;amp;&amp;amp; (accountDF("id") !== accountDF2("id"))
)
newDF.show
&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 19:05:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29813#M21514</guid>
      <dc:creator>hansonkx</dc:creator>
      <dc:date>2017-11-29T19:05:13Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29814#M21515</link>
      <description>&lt;P&gt;The great question about Fuzzy text matching in Spark, this is unique topic, and part of fuzzy Logic .&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Jun 2019 03:47:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29814#M21515</guid>
      <dc:creator>Er__Ram_Saran_B</dc:creator>
      <dc:date>2019-06-05T03:47:36Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29815#M21516</link>
      <description>&lt;P&gt;You can use Zingg: Spark based open source tool for this &lt;A href="https://github.com/zinggAI/zingg" target="test_blank"&gt;https://github.com/zinggAI/zingg&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Sep 2021 07:13:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/29815#M21516</guid>
      <dc:creator>Sonal</dc:creator>
      <dc:date>2021-09-14T07:13:20Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/135067#M50270</link>
      <description>&lt;P&gt;You can refer to this article&amp;nbsp;&lt;A href="https://medium.com/@gavaragirijarani/optimizing-large-scale-fuzzy-matching-with-apache-spark-and-databricks-3a0245165991" target="_blank"&gt;Optimizing Large-Scale Fuzzy Matching with Apache Spark and Databricks | by Gavaragirijarani | Medium&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;As far as open-source libraries go, rapidfuzz is known to be faster than fuzzywuzzy.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Oct 2025 06:16:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/135067#M50270</guid>
      <dc:creator>Edthehead</dc:creator>
      <dc:date>2025-10-16T06:16:55Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/135390#M50341</link>
      <description>&lt;P&gt;+1 for rapidfuzz, I have used it in production pipelines. Better than just levenshtein function, as rapidfuzz provides a couple of other algorithms as well. I will warn you to not do what 2024 me attempted, which is use LLM to solve for this. It sounded like a good use-case until I implemented it and the $$ spend to value returned was terrible.&lt;/P&gt;</description>
      <pubDate>Mon, 20 Oct 2025 04:28:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/135390#M50341</guid>
      <dc:creator>Shamzaa3Q</dc:creator>
      <dc:date>2025-10-20T04:28:57Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy text matching in Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/149875#M53193</link>
      <description>&lt;P&gt;+1 on LLMs. I would check this article on using Similarity API instead of rapidfuzz in 2026 especially for larger/growing datasets&amp;nbsp;&lt;A href="https://medium.com/p/0854593e380a" target="_blank"&gt;https://medium.com/p/0854593e380a&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Mar 2026 07:23:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fuzzy-text-matching-in-spark/m-p/149875#M53193</guid>
      <dc:creator>RheaC</dc:creator>
      <dc:date>2026-03-05T07:23:52Z</dc:date>
    </item>
  </channel>
</rss>

