Fuzzy text matching in Spark

manugarri
New Contributor II

I have a list of client provided data, a list of company names.

I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark for accesing it.

How could I go and match the client list? I was thinking in doing a matrix (RowMatrix) of N x D elements, n being the number of client elements and D being the length of the internal client list) and compute the similarities pair wise.

How could I do this in Spark? Any help would be more than welcome.