Databricks Community

RheaC · 03-04-2026

+1 on LLMs. I would check this article on using Similarity API instead of rapidfuzz in 2026 especially for larger/growing datasets https://medium.com/p/0854593e380a

RheaC · 03-04-2026

On a dataset with millions of rows, approxSimilarityJoin(df, df, …) can become slow because it has to build a large list of candidate pairs (rows that might match) before it can score and filter them.Candidate explosion means your settings produce to...

Databricks Community

User Stats

User Activity

Re: Fuzzy text matching in Spark

Re: Performance Issue with MinHash + Approx Similarity Join for Fuzzy Duplicate Detection