Databricks Community

davidmory38 · ‎08-06-2021

Hello people,

I'm trying to build a facial recognition application, and I have a working API, that takes in an image of a face and spits out a vector that encodes it. I need to run this on a million faces, store them in a db and when the system goes online, it should take a face in, get a vector and compute the distance with all the other vectors to find the closest one. https://1921681001.id/

I'm hearing of locality sensitive hashing, and that mak https://19216811.cam/es sense, but what else can I do at the level of db selection and design that makes these things quicker? TIA

Dan_Z · ‎08-06-2021

You could do this with Spark storing in parquet/Delta. For each face you would write out a record with a column for metadata, a column for the encoded vector array, and other columns for hashing. You could use a PandasUDF to do the distributed distance calculation at scale, and could probably get fast run times on a million records.

'm not sure how you would come up with the hash criteria, but if you came up with some way to bin the vector encodings, you could add a column to the parquet/delta table with which vector encoding bin the vector falls into and then partition the table on that (or some combination of multiple bins). If you set it up that way, you could ensure that your PandasUDF only finds close matches within the partition/bin, which will speed up the match time. The downside is that you will miss out on edge cases where a vector got put into one partition, but its closest match was actually in another.

For just a million records, I'd suggest avoid binning and if you need, encode your arrays as needed to reduce their length.

Databricks Community

Best Database for facial recognition/ Fast comparisons of Euclidean distance

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming