cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Best Database for facial recognition/ Fast comparisons of Euclidean distance

davidmory38
New Contributor

Hello people,

I'm trying to build a facial recognition application, and I have a working API, that takes in an image of a face and spits out a vector that encodes it. I need to run this on a million faces, store them in a db and when the system goes online, it should take a face in, get a vector and compute the distance with all the other vectors to find the closest one. https://1921681001.id/

I'm hearing of locality sensitive hashing, and that mak https://19216811.cam/es sense, but what else can I do at the level of db selection and design that makes these things quicker? TIA

1 REPLY 1

Dan_Z
Honored Contributor
Honored Contributor

You could do this with Spark storing in parquet/Delta. For each face you would write out a record with a column for metadata, a column for the encoded vector array, and other columns for hashing. You could use a PandasUDF to do the distributed distance calculation at scale, and could probably get fast run times on a million records.

'm not sure how you would come up with the hash criteria, but if you came up with some way to bin the vector encodings, you could add a column to the parquet/delta table with which vector encoding bin the vector falls into and then partition the table on that (or some combination of multiple bins). If you set it up that way, you could ensure that your PandasUDF only finds close matches within the partition/bin, which will speed up the match time. The downside is that you will miss out on edge cases where a vector got put into one partition, but its closest match was actually in another.

For just a million records, I'd suggest avoid binning and if you need, encode your arrays as needed to reduce their length.