RheaC
New Contributor II

On a dataset with millions of rows, approxSimilarityJoin(df, df, …) can become slow because it has to build a large list of candidate pairs (rows that might match) before it can score and filter them.

Candidate explosion means your settings produce too many “maybe” pairs. Even if you later keep only matches > 0.95, Spark already paid the cost to generate and move those candidates around.

Shuffle is the network-heavy step where Spark moves data across the cluster to group rows that land in the same LSH buckets so they can be compared.

Skew is when some buckets end up much larger than others (often caused by very common tokens like “ltd”, “street”, etc.). A few overloaded tasks then dominate total runtime.

If you keep it in Spark, the usual levers are:

  1. Block first to reduce who can be compared, then run matching within each block.

  2. Make the LSH distance threshold line up with your final threshold (don’t generate candidates far below what you’ll accept).

  3. Remove junk inputs and very common tokens (short/placeholder strings, boilerplate words) to reduce collisions.

If you’d rather avoid maintaining and tuning the LSH pipeline, look into Similarity API -  It’s a hosted “dedupe within one dataset” service: you send the list once and it returns duplicate pairs/clusters without running an all-to-all join in your Databricks job. It’s optimized for large datasets and does the blocking/candidate generation and scoring for you, returning duplicate pairs/clusters. Preprocessing is optional but usually helpful (e.g., lowercasing, punctuation removal, token sorting), especially when names/addresses have formatting noise.

Most if not all of the above could be substituted with something like this:

import os, requests

resp = requests.post(
  "https://api.similarity-api.com/dedupe",
  headers={
    "Authorization": f"Bearer {os.environ['SIMILARITY_API_KEY']}",
    "Content-Type": "application/json",
  },
  json={
    "data": ["Microsoft", "Microsft", "Apple Inc.", "appLE",..],
    "config": {
      "similarity_threshold": 0.70,
      "top_n": 5,
      "remove_punctuation": True,
      "to_lowercase": True,
      "use_token_sort": False,
      "output_format": "flat_table",
    },
  }
)
print(resp.json())

Here are the full docs: https://similarity-api.com/documentation
There's an article on how to do it databricks: https://similarity-api.com/blog/fuzzy-matching-in-databricks-2026

This will save you cluster time but the service itself is paid - you need to make an account to get a token and then pay after some trial.