Databricks Community

ashfire · Thursday

I’m currently using FAISS in a Databricks notebook to perform semantic search in text data. My current workflow looks like this:

encode ~10k text entries using an embedding model.
build a FAISS index in memory.
run similarity searches using index.search().

This works fine for 10k texts, but as the dataset grows, rebuilding the FAISS index every time is becoming slow. I’d like to Incrementally add new embeddings to the existing index instead of rebuilding from scratch.

Here’s a simplified snippet of what I’m doing now:

import faiss
import numpy as np

query_emb = embed_model.encode([query_text], normalize_embeddings=True)
query_emb = np.array(query_emb, dtype=np.float32)
faiss.normalize_L2(query_emb)
distances, indices = index.search(query_emb, top_N)

What is the best practice in Databricks to save/load FAISS indexes efficiently (preferably on DBFS or Delta)?
How can I safely add new embeddings incrementally to an existing FAISS index?
How do others handle metadata (like text or IDs) alongside the FAISS index for retrieval?

I’d be glad to have guidance or example patterns for a scalable semantic search pipeline in Databricks using FAISS.

#FAISS #SemanticSearch #VectorSearch

Louis_Frolio · Thursday

Hello @ashfire ,

Here’s a practical path to scale your FAISS workflow on Databricks, along with patterns to persist indexes, incrementally add embeddings, and keep metadata aligned.

Best practice to persist/load FAISS indexes on Databricks

Use faiss.write_index/read_index to save/load the index as a single file on a UC Volume. This keeps I/O simple and fast for driver-side code. If you ever use a GPU index, convert it to CPU before writing, then back to GPU after reading:

import faiss

# Save to DBFS (CPU index required for write)
cpu_index = faiss.index_gpu_to_cpu(index) if isinstance(index, faiss.GpuIndex) else index
faiss.write_index(cpu_index, "/dbfs/FileStore/semantic/faiss_index.faiss")

# Load back
index = faiss.read_index("/dbfs/FileStore/semantic/faiss_index.faiss")

Keep any non-index artifacts (e.g., ID mapping or stats) in a separate file (pickle/JSON) next to your index if you aren’t using FAISS’s IDMap. Example pattern: /Volume/myVolume/semantic/faiss_index.faiss +/Volume/myVolume/semantic/id_map.json. The I/O is efficient and simple to manage with notebook jobs on Databricks.

Safely adding embeddings incrementally to an existing FAISS index

There are two common cases:

For Flat indexes (e.g., IndexFlatL2/IndexFlatIP):

You can append vectors with index.add(new_vecs).

If you need persistent, stable IDs for later lookups, wrap with **IndexIDMap** and use `add_with_ids` (IDs must be int64):

import faiss
import numpy as np

base = faiss.IndexFlatIP(dim)  # or IndexFlatL2
index = faiss.IndexIDMap(base)

new_vecs = np.asarray(new_embeddings, dtype=np.float32)
new_ids  = np.asarray(new_int64_ids, dtype=np.int64)
index.add_with_ids(new_vecs, new_ids)

To update existing vectors: remove the IDs you want to change and re-add the new embeddings with the same IDs:

ids_to_update = np.asarray(ids_list, dtype=np.int64)
index.remove_ids(ids_to_update)
index.add_with_ids(updated_vecs, ids_to_update)

For IVF/PQ indexes (e.g., IndexIVFPQ):
- Train once on a representative sample (you don’t need the full dataset), then you can add new embeddings incrementally without retraining the index. This is the usual pattern for large datasets where training on all vectors isn’t feasible.
- If your data distribution drifts significantly over time, you can periodically retrain with an updated representative sample and rebuild during a maintenance window, but the common approach is “train once, add many.”

Handling metadata alongside FAISS for retrieval

Store your metadata in a Delta table governed by Unity Catalog. Keep a stable primary key you also use as the FAISS ID:

Table schema example: id STRING, text STRING, embedding ARRAY<FLOAT>, plus any attributes for filtering.

When you do `index.search(...)` or `index.search`+IDMap lookups, use the returned IDs to join back to the Delta table:

import numpy as np

# Query
q = np.asarray(query_emb, dtype=np.float32)
D, I = index.search(q.reshape(1, -1), top_k)
hit_ids = [int(i) for i in I[0] if i != -1]

# Fetch metadata via Spark/Delta
from pyspark.sql import functions as F
hits_df = spark.table("main.semantic.docs").where(F.col("id").isin(hit_ids))
display(hits_df.orderBy(F.array_position(F.array(*[F.lit(x) for x in hit_ids]), F.col("id"))))

python

If you prefer to keep embeddings in Delta as well (for audits/versioning or hybrid pipelines), store them in ARRAY<FLOAT>. FAISS still expects NumPy arrays at runtime, so load batches from Delta into memory when building/updating the index.

A scalable semantic search pipeline pattern on Databricks (FAISS)

Ingest data into a Delta table with a stable primary key.
Use a batch job to compute embeddings (Spark UDF, Model Serving, or a Python batch on the driver) and write results to a Delta table with id + embedding.
Maintain a FAISS index artifact in DBFS or a UC Volume:
- Initial build: read embeddings in batches, train if IVF/PQ, add vectors, then write_index.
- Incremental updates: detect new/changed rows (e.g., “last updated” column) and:
  - For flat indexes: add_with_ids.
  - For updates: remove_ids, then add_with_ids.
  - For IVF/PQ: ensure the index is trained; then add new embeddings. Periodically evaluate drift and re-train if needed.
Keep a small ID map file (if you don’t use IndexIDMap) or just rely on IndexIDMap with int64 IDs to avoid external mapping.

Consider Databricks Mosaic AI Vector Search (managed alternative)

If you’d prefer not to manage FAISS lifecycle, Databricks Mosaic AI Vector Search provides a managed index that:

Creates an index from a Delta table with automatic incremental sync (Delta Sync Index), or lets you write vectors directly (Direct Vector Access).
Stores and returns both vectors and associated metadata with Unity Catalog governance, supports hybrid keyword+vector search, and scales to very large datasets (including new storage‑optimized endpoints).
You interact via SDK/REST/SQL (vector_search()), and the service handles indexing, scaling, and syncing for you.

Example (Python SDK) to query an index:

%pip install databricks-vectorsearch
dbutils.library.restartPython()

from databricks.vector_search.client import VectorSearchClient
client = VectorSearchClient()
index = client.get_index(index_name="main.semantic.my_index")

results = index.similarity_search(
    query_text="my query",  # or query_vector=[...]
    num_results=5,
    columns=["id", "text", "tags", "date"]
)

Quick checklist

Use IndexIDMap and int64 IDs to align FAISS results with Delta metadata.
For IVF/PQ: train once on representative data, then add incrementally; plan occasional retraining if data distribution shifts.
Persist indexes with faiss.write_index/read_index; convert GPU ↔ CPU when saving/loading.
Keep metadata in Delta, governed by Unity Catalog, and join by IDs for retrieval.
For hands-off scaling and incremental sync, evaluate Mosaic AI Vector Search instead of DIY FAISS.