Iām currently using FAISS in a Databricks notebook to perform semantic search in text data. My current workflow looks like this:
- encode ~10k text entries using an embedding model.
- build a FAISS index in memory.
- run similarity searches using index.search().
This works fine for 10k texts, but as the dataset grows, rebuilding the FAISS index every time is becoming slow. Iād like to Incrementally add new embeddings to the existing index instead of rebuilding from scratch.
Hereās a simplified snippet of what Iām doing now:
import faiss
import numpy as np
query_emb = embed_model.encode([query_text], normalize_embeddings=True)
query_emb = np.array(query_emb, dtype=np.float32)
faiss.normalize_L2(query_emb)
distances, indices = index.search(query_emb, top_N)
- What is the best practice in Databricks to save/load FAISS indexes efficiently (preferably on DBFS or Delta)?
- How can I safely add new embeddings incrementally to an existing FAISS index?
- How do others handle metadata (like text or IDs) alongside the FAISS index for retrieval?
Iād be glad to have guidance or example patterns for a scalable semantic search pipeline in Databricks using FAISS.
#FAISS #SemanticSearch #VectorSearch