3 weeks ago
Iām currently using FAISS in a Databricks notebook to perform semantic search in text data. My current workflow looks like this:
This works fine for 10k texts, but as the dataset grows, rebuilding the FAISS index every time is becoming slow. Iād like to Incrementally add new embeddings to the existing index instead of rebuilding from scratch.
Hereās a simplified snippet of what Iām doing now:
import faiss import numpy as np query_emb = embed_model.encode([query_text], normalize_embeddings=True) query_emb = np.array(query_emb, dtype=np.float32) faiss.normalize_L2(query_emb) distances, indices = index.search(query_emb, top_N)
Iād be glad to have guidance or example patterns for a scalable semantic search pipeline in Databricks using FAISS.
#FAISS #SemanticSearch #VectorSearch
3 weeks ago
Hello @ashfire ,
Hereās a practical path to scale your FAISS workflow on Databricks, along with patterns to persist indexes, incrementally add embeddings, and keep metadata aligned.
Use faiss.write_index/read_index to save/load the index as a single file on a UC Volume. This keeps I/O simple and fast for driver-side code. If you ever use a GPU index, convert it to CPU before writing, then back to GPU after reading:
import faiss
# Save to DBFS (CPU index required for write)
cpu_index = faiss.index_gpu_to_cpu(index) if isinstance(index, faiss.GpuIndex) else index
faiss.write_index(cpu_index, "/dbfs/FileStore/semantic/faiss_index.faiss")
# Load back
index = faiss.read_index("/dbfs/FileStore/semantic/faiss_index.faiss")
Keep any non-index artifacts (e.g., ID mapping or stats) in a separate file (pickle/JSON) next to your index if you arenāt using FAISSās IDMap. Example pattern: /Volume/myVolume/semantic/faiss_index.faiss +/Volume/myVolume/semantic/id_map.json. The I/O is efficient and simple to manage with notebook jobs on Databricks.
There are two common cases:
For Flat indexes (e.g., IndexFlatL2/IndexFlatIP):
index.add(new_vecs).import faiss
import numpy as np
base = faiss.IndexFlatIP(dim) # or IndexFlatL2
index = faiss.IndexIDMap(base)
new_vecs = np.asarray(new_embeddings, dtype=np.float32)
new_ids = np.asarray(new_int64_ids, dtype=np.int64)
index.add_with_ids(new_vecs, new_ids)
To update existing vectors: remove the IDs you want to change and re-add the new embeddings with the same IDs:
ids_to_update = np.asarray(ids_list, dtype=np.int64)
index.remove_ids(ids_to_update)
index.add_with_ids(updated_vecs, ids_to_update)
For IVF/PQ indexes (e.g., IndexIVFPQ):
Store your metadata in a Delta table governed by Unity Catalog. Keep a stable primary key you also use as the FAISS ID:
id STRING, text STRING, embedding ARRAY<FLOAT>, plus any attributes for filtering.import numpy as np
# Query
q = np.asarray(query_emb, dtype=np.float32)
D, I = index.search(q.reshape(1, -1), top_k)
hit_ids = [int(i) for i in I[0] if i != -1]
# Fetch metadata via Spark/Delta
from pyspark.sql import functions as F
hits_df = spark.table("main.semantic.docs").where(F.col("id").isin(hit_ids))
display(hits_df.orderBy(F.array_position(F.array(*[F.lit(x) for x in hit_ids]), F.col("id"))))
If you prefer to keep embeddings in Delta as well (for audits/versioning or hybrid pipelines), store them in ARRAY<FLOAT>. FAISS still expects NumPy arrays at runtime, so load batches from Delta into memory when building/updating the index.
id + embedding.write_index.add_with_ids.remove_ids, then add_with_ids.add new embeddings. Periodically evaluate drift and re-train if needed.If youād prefer not to manage FAISS lifecycle, Databricks Mosaic AI Vector Search provides a managed index that:
vector_search()), and the service handles indexing, scaling, and syncing for you.Example (Python SDK) to query an index:
%pip install databricks-vectorsearch
dbutils.library.restartPython()
from databricks.vector_search.client import VectorSearchClient
client = VectorSearchClient()
index = client.get_index(index_name="main.semantic.my_index")
results = index.similarity_search(
query_text="my query", # or query_vector=[...]
num_results=5,
columns=["id", "text", "tags", "date"]
)
Hope this helps, Louis.
3 weeks ago
Hello @ashfire ,
Hereās a practical path to scale your FAISS workflow on Databricks, along with patterns to persist indexes, incrementally add embeddings, and keep metadata aligned.
Use faiss.write_index/read_index to save/load the index as a single file on a UC Volume. This keeps I/O simple and fast for driver-side code. If you ever use a GPU index, convert it to CPU before writing, then back to GPU after reading:
import faiss
# Save to DBFS (CPU index required for write)
cpu_index = faiss.index_gpu_to_cpu(index) if isinstance(index, faiss.GpuIndex) else index
faiss.write_index(cpu_index, "/dbfs/FileStore/semantic/faiss_index.faiss")
# Load back
index = faiss.read_index("/dbfs/FileStore/semantic/faiss_index.faiss")
Keep any non-index artifacts (e.g., ID mapping or stats) in a separate file (pickle/JSON) next to your index if you arenāt using FAISSās IDMap. Example pattern: /Volume/myVolume/semantic/faiss_index.faiss +/Volume/myVolume/semantic/id_map.json. The I/O is efficient and simple to manage with notebook jobs on Databricks.
There are two common cases:
For Flat indexes (e.g., IndexFlatL2/IndexFlatIP):
index.add(new_vecs).import faiss
import numpy as np
base = faiss.IndexFlatIP(dim) # or IndexFlatL2
index = faiss.IndexIDMap(base)
new_vecs = np.asarray(new_embeddings, dtype=np.float32)
new_ids = np.asarray(new_int64_ids, dtype=np.int64)
index.add_with_ids(new_vecs, new_ids)
To update existing vectors: remove the IDs you want to change and re-add the new embeddings with the same IDs:
ids_to_update = np.asarray(ids_list, dtype=np.int64)
index.remove_ids(ids_to_update)
index.add_with_ids(updated_vecs, ids_to_update)
For IVF/PQ indexes (e.g., IndexIVFPQ):
Store your metadata in a Delta table governed by Unity Catalog. Keep a stable primary key you also use as the FAISS ID:
id STRING, text STRING, embedding ARRAY<FLOAT>, plus any attributes for filtering.import numpy as np
# Query
q = np.asarray(query_emb, dtype=np.float32)
D, I = index.search(q.reshape(1, -1), top_k)
hit_ids = [int(i) for i in I[0] if i != -1]
# Fetch metadata via Spark/Delta
from pyspark.sql import functions as F
hits_df = spark.table("main.semantic.docs").where(F.col("id").isin(hit_ids))
display(hits_df.orderBy(F.array_position(F.array(*[F.lit(x) for x in hit_ids]), F.col("id"))))
If you prefer to keep embeddings in Delta as well (for audits/versioning or hybrid pipelines), store them in ARRAY<FLOAT>. FAISS still expects NumPy arrays at runtime, so load batches from Delta into memory when building/updating the index.
id + embedding.write_index.add_with_ids.remove_ids, then add_with_ids.add new embeddings. Periodically evaluate drift and re-train if needed.If youād prefer not to manage FAISS lifecycle, Databricks Mosaic AI Vector Search provides a managed index that:
vector_search()), and the service handles indexing, scaling, and syncing for you.Example (Python SDK) to query an index:
%pip install databricks-vectorsearch
dbutils.library.restartPython()
from databricks.vector_search.client import VectorSearchClient
client = VectorSearchClient()
index = client.get_index(index_name="main.semantic.my_index")
results = index.similarity_search(
query_text="my query", # or query_vector=[...]
num_results=5,
columns=["id", "text", "tags", "date"]
)
Hope this helps, Louis.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityāsign up today to get started!
Sign Up Now