cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

How to store & update a FAISS Index in Databricks

ashfire
New Contributor II

I’m currently using FAISS in a Databricks notebook to perform semantic search in text data. My current workflow looks like this:

  1. encode ~10k text entries using an embedding model.
  2. build a FAISS index in memory.
  3. run similarity searches using index.search().

This works fine for 10k texts, but as the dataset grows, rebuilding the FAISS index every time is becoming slow. I’d like to Incrementally add new embeddings to the existing index instead of rebuilding from scratch.

Here’s a simplified snippet of what I’m doing now:

import faiss
import numpy as np

query_emb = embed_model.encode([query_text], normalize_embeddings=True)
query_emb = np.array(query_emb, dtype=np.float32)
faiss.normalize_L2(query_emb)
distances, indices = index.search(query_emb, top_N)
  • What is the best practice in Databricks to save/load FAISS indexes efficiently (preferably on DBFS or Delta)?
  • How can I safely add new embeddings incrementally to an existing FAISS index?
  • How do others handle metadata (like text or IDs) alongside the FAISS index for retrieval?

I’d be glad to have guidance or example patterns for a scalable semantic search pipeline in Databricks using FAISS.

#FAISS #SemanticSearch #VectorSearch

 

 

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @ashfire ,

Here’s a practical path to scale your FAISS workflow on Databricks, along with patterns to persist indexes, incrementally add embeddings, and keep metadata aligned.

Best practice to persist/load FAISS indexes on Databricks

  • Use faiss.write_index/read_index to save/load the index as a single file on a UC Volume. This keeps I/O simple and fast for driver-side code. If you ever use a GPU index, convert it to CPU before writing, then back to GPU after reading:

    import faiss
    
    # Save to DBFS (CPU index required for write)
    cpu_index = faiss.index_gpu_to_cpu(index) if isinstance(index, faiss.GpuIndex) else index
    faiss.write_index(cpu_index, "/dbfs/FileStore/semantic/faiss_index.faiss")
    
    # Load back
    index = faiss.read_index("/dbfs/FileStore/semantic/faiss_index.faiss")
     
  • Keep any non-index artifacts (e.g., ID mapping or stats) in a separate file (pickle/JSON) next to your index if you aren’t using FAISS’s IDMap. Example pattern: /Volume/myVolume/semantic/faiss_index.faiss +/Volume/myVolume/semantic/id_map.json. The I/O is efficient and simple to manage with notebook jobs on Databricks.

Safely adding embeddings incrementally to an existing FAISS index

There are two common cases:

  • For Flat indexes (e.g., IndexFlatL2/IndexFlatIP):

    • You can append vectors with index.add(new_vecs).
    • If you need persistent, stable IDs for later lookups, wrap with **IndexIDMap** and use `add_with_ids` (IDs must be int64):
      import faiss
      import numpy as np
      
      base = faiss.IndexFlatIP(dim)  # or IndexFlatL2
      index = faiss.IndexIDMap(base)
      
      new_vecs = np.asarray(new_embeddings, dtype=np.float32)
      new_ids  = np.asarray(new_int64_ids, dtype=np.int64)
      index.add_with_ids(new_vecs, new_ids)
       
    • To update existing vectors: remove the IDs you want to change and re-add the new embeddings with the same IDs:

      ids_to_update = np.asarray(ids_list, dtype=np.int64)
      index.remove_ids(ids_to_update)
      index.add_with_ids(updated_vecs, ids_to_update)
       
  • For IVF/PQ indexes (e.g., IndexIVFPQ):

    • Train once on a representative sample (you don’t need the full dataset), then you can add new embeddings incrementally without retraining the index. This is the usual pattern for large datasets where training on all vectors isn’t feasible.
    • If your data distribution drifts significantly over time, you can periodically retrain with an updated representative sample and rebuild during a maintenance window, but the common approach is ā€œtrain once, add many.ā€

Handling metadata alongside FAISS for retrieval

  • Store your metadata in a Delta table governed by Unity Catalog. Keep a stable primary key you also use as the FAISS ID:

    • Table schema example: id STRING, text STRING, embedding ARRAY<FLOAT>, plus any attributes for filtering.
    • When you do `index.search(...)` or `index.search`+IDMap lookups, use the returned IDs to join back to the Delta table:
      import numpy as np
      
      # Query
      q = np.asarray(query_emb, dtype=np.float32)
      D, I = index.search(q.reshape(1, -1), top_k)
      hit_ids = [int(i) for i in I[0] if i != -1]
      
      # Fetch metadata via Spark/Delta
      from pyspark.sql import functions as F
      hits_df = spark.table("main.semantic.docs").where(F.col("id").isin(hit_ids))
      display(hits_df.orderBy(F.array_position(F.array(*[F.lit(x) for x in hit_ids]), F.col("id"))))
      python
       
  • If you prefer to keep embeddings in Delta as well (for audits/versioning or hybrid pipelines), store them in ARRAY<FLOAT>. FAISS still expects NumPy arrays at runtime, so load batches from Delta into memory when building/updating the index.

A scalable semantic search pipeline pattern on Databricks (FAISS)

  • Ingest data into a Delta table with a stable primary key.
  • Use a batch job to compute embeddings (Spark UDF, Model Serving, or a Python batch on the driver) and write results to a Delta table with id + embedding.
  • Maintain a FAISS index artifact in DBFS or a UC Volume:
    • Initial build: read embeddings in batches, train if IVF/PQ, add vectors, then write_index.
    • Incremental updates: detect new/changed rows (e.g., ā€œlast updatedā€ column) and:
      • For flat indexes: add_with_ids.
      • For updates: remove_ids, then add_with_ids.
      • For IVF/PQ: ensure the index is trained; then add new embeddings. Periodically evaluate drift and re-train if needed.
  • Keep a small ID map file (if you don’t use IndexIDMap) or just rely on IndexIDMap with int64 IDs to avoid external mapping.

Consider Databricks Mosaic AI Vector Search (managed alternative)

If you’d prefer not to manage FAISS lifecycle, Databricks Mosaic AI Vector Search provides a managed index that:

  • Creates an index from a Delta table with automatic incremental sync (Delta Sync Index), or lets you write vectors directly (Direct Vector Access).
  • Stores and returns both vectors and associated metadata with Unity Catalog governance, supports hybrid keyword+vector search, and scales to very large datasets (including new storage‑optimized endpoints).
  • You interact via SDK/REST/SQL (vector_search()), and the service handles indexing, scaling, and syncing for you.

Example (Python SDK) to query an index:

%pip install databricks-vectorsearch
dbutils.library.restartPython()

from databricks.vector_search.client import VectorSearchClient
client = VectorSearchClient()
index = client.get_index(index_name="main.semantic.my_index")

results = index.similarity_search(
    query_text="my query",  # or query_vector=[...]
    num_results=5,
    columns=["id", "text", "tags", "date"]
)

Quick checklist

  • Use IndexIDMap and int64 IDs to align FAISS results with Delta metadata.
  • For IVF/PQ: train once on representative data, then add incrementally; plan occasional retraining if data distribution shifts.
  • Persist indexes with faiss.write_index/read_index; convert GPU ↔ CPU when saving/loading.
  • Keep metadata in Delta, governed by Unity Catalog, and join by IDs for retrieval.
  • For hands-off scaling and incremental sync, evaluate Mosaic AI Vector Search instead of DIY FAISS.

Hope this helps, Louis.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now