<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to store &amp;amp; update a FAISS Index in Databricks in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/how-to-store-amp-update-a-faiss-index-in-databricks/m-p/138938#M4440</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/187797"&gt;@ashfire&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P class="qt3gz91 paragraph"&gt;Here’s a practical path to scale your FAISS workflow on Databricks, along with patterns to persist indexes, incrementally add embeddings, and keep metadata aligned.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Best practice to persist/load FAISS indexes on Databricks&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Use &lt;STRONG&gt;faiss.write_index/read_index&lt;/STRONG&gt; to save/load the index as a single file on a UC Volume. This keeps I/O simple and fast for driver-side code. If you ever use a GPU index, convert it to CPU before writing, then back to GPU after reading:&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; faiss

&lt;SPAN class="hljs-comment"&gt;# Save to DBFS (CPU index required for write)&lt;/SPAN&gt;
cpu_index = faiss.index_gpu_to_cpu(index) &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; &lt;SPAN class="hljs-built_in"&gt;isinstance&lt;/SPAN&gt;(index, faiss.GpuIndex) &lt;SPAN class="hljs-keyword"&gt;else&lt;/SPAN&gt; index
faiss.write_index(cpu_index, &lt;SPAN class="hljs-string"&gt;"/dbfs/FileStore/semantic/faiss_index.faiss"&lt;/SPAN&gt;)

&lt;SPAN class="hljs-comment"&gt;# Load back&lt;/SPAN&gt;
index = faiss.read_index(&lt;SPAN class="hljs-string"&gt;"/dbfs/FileStore/semantic/faiss_index.faiss"&lt;/SPAN&gt;)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Keep any non-index artifacts (e.g., ID mapping or stats) in a separate file (pickle/JSON) next to your index if you aren’t using FAISS’s IDMap. Example pattern: &lt;CODE class="qt3gz9f"&gt;/Volume/myVolume/semantic/faiss_index.faiss&lt;/CODE&gt; +&lt;CODE class="qt3gz9f"&gt;/Volume/myVolume/&lt;/CODE&gt;&lt;CODE class="qt3gz9f"&gt;semantic/id_map.json&lt;/CODE&gt;. The I/O is efficient and simple to manage with notebook jobs on Databricks.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Safely adding embeddings incrementally to an existing FAISS index&lt;/H3&gt;
&lt;P class="qt3gz91 paragraph"&gt;There are two common cases:&lt;/P&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For &lt;STRONG&gt;Flat indexes&lt;/STRONG&gt; (e.g., IndexFlatL2/IndexFlatIP):&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;You can append vectors with &lt;CODE class="qt3gz9f"&gt;index.add(new_vecs)&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;If you need persistent, stable IDs for later lookups, wrap with **IndexIDMap** and use `add_with_ids` (IDs must be int64):
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; faiss
&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; numpy &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; np

base = faiss.IndexFlatIP(dim)  &lt;SPAN class="hljs-comment"&gt;# or IndexFlatL2&lt;/SPAN&gt;
index = faiss.IndexIDMap(base)

new_vecs = np.asarray(new_embeddings, dtype=np.float32)
new_ids  = np.asarray(new_int64_ids, dtype=np.int64)
index.add_with_ids(new_vecs, new_ids)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;To update existing vectors: remove the IDs you want to change and re-add the new embeddings with the same IDs:&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;ids_to_update = np.asarray(ids_list, dtype=np.int64)
index.remove_ids(ids_to_update)
index.add_with_ids(updated_vecs, ids_to_update)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For &lt;STRONG&gt;IVF/PQ indexes&lt;/STRONG&gt; (e.g., IndexIVFPQ):&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Train once on a representative sample (you don’t need the full dataset), then you can &lt;STRONG&gt;add&lt;/STRONG&gt; new embeddings incrementally without retraining the index. This is the usual pattern for large datasets where training on all vectors isn’t feasible.
&lt;DIV class="_7pq7t614 _7pq7t6cj wrz27r2 wrz27r0"&gt;
&lt;DIV class="xh5urp3 xh5urp1 xh5urp0" role="presentation" aria-label="Citation 5"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;If your data distribution drifts significantly over time, you can periodically retrain with an updated representative sample and rebuild during a maintenance window, but the common approach is “train once, add many.”
&lt;DIV class="_7pq7t614 _7pq7t6cj wrz27r2 wrz27r0"&gt;
&lt;DIV class="xh5urp3 xh5urp1 xh5urp0" role="presentation" aria-label="Citation 5"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Handling metadata alongside FAISS for retrieval&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Store your metadata in a &lt;STRONG&gt;Delta table&lt;/STRONG&gt; governed by &lt;STRONG&gt;Unity Catalog&lt;/STRONG&gt;. Keep a stable primary key you also use as the FAISS ID:&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Table schema example: &lt;CODE class="qt3gz9f"&gt;id STRING&lt;/CODE&gt;, &lt;CODE class="qt3gz9f"&gt;text STRING&lt;/CODE&gt;, &lt;CODE class="qt3gz9f"&gt;embedding ARRAY&amp;lt;FLOAT&amp;gt;&lt;/CODE&gt;, plus any attributes for filtering.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;When you do `index.search(...)` or `index.search`+IDMap lookups, use the returned IDs to join back to the Delta table:
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; numpy &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; np

&lt;SPAN class="hljs-comment"&gt;# Query&lt;/SPAN&gt;
q = np.asarray(query_emb, dtype=np.float32)
D, I = index.search(q.reshape(&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;, -&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;), top_k)
hit_ids = [&lt;SPAN class="hljs-built_in"&gt;int&lt;/SPAN&gt;(i) &lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; i &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; I[&lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;] &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; i != -&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;]

&lt;SPAN class="hljs-comment"&gt;# Fetch metadata via Spark/Delta&lt;/SPAN&gt;
&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; functions &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; F
hits_df = spark.table(&lt;SPAN class="hljs-string"&gt;"main.semantic.docs"&lt;/SPAN&gt;).where(F.col(&lt;SPAN class="hljs-string"&gt;"id"&lt;/SPAN&gt;).isin(hit_ids))
display(hits_df.orderBy(F.array_position(F.array(*[F.lit(x) &lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; x &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; hit_ids]), F.col(&lt;SPAN class="hljs-string"&gt;"id"&lt;/SPAN&gt;))))&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;
&lt;DIV class="go8b9g5 _7pq7t6ch"&gt;python&lt;/DIV&gt;
&lt;DIV class="_17yk06p0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you prefer to keep embeddings in Delta as well (for audits/versioning or hybrid pipelines), store them in &lt;CODE class="qt3gz9f"&gt;ARRAY&amp;lt;FLOAT&amp;gt;&lt;/CODE&gt;. FAISS still expects NumPy arrays at runtime, so load batches from Delta into memory when building/updating the index.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;A scalable semantic search pipeline pattern on Databricks (FAISS)&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Ingest data into a &lt;STRONG&gt;Delta table&lt;/STRONG&gt; with a stable &lt;STRONG&gt;primary key&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Use a &lt;STRONG&gt;batch job&lt;/STRONG&gt; to compute embeddings (Spark UDF, Model Serving, or a Python batch on the driver) and write results to a Delta table with &lt;CODE class="qt3gz9f"&gt;id&lt;/CODE&gt; + &lt;CODE class="qt3gz9f"&gt;embedding&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Maintain a &lt;STRONG&gt;FAISS index artifact&lt;/STRONG&gt; in DBFS or a UC Volume:
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Initial build: read embeddings in batches, train if IVF/PQ, add vectors, then &lt;CODE class="qt3gz9f"&gt;write_index&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Incremental updates: detect new/changed rows (e.g., “last updated” column) and:
&lt;UL class="qt3gz99 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;For flat indexes: &lt;CODE class="qt3gz9f"&gt;add_with_ids&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;For updates: &lt;CODE class="qt3gz9f"&gt;remove_ids&lt;/CODE&gt;, then &lt;CODE class="qt3gz9f"&gt;add_with_ids&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;For IVF/PQ: ensure the index is trained; then &lt;CODE class="qt3gz9f"&gt;add&lt;/CODE&gt; new embeddings. Periodically evaluate drift and re-train if needed.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Keep a small &lt;STRONG&gt;ID map file&lt;/STRONG&gt; (if you don’t use IndexIDMap) or just rely on IndexIDMap with int64 IDs to avoid external mapping.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Consider Databricks Mosaic AI Vector Search (managed alternative)&lt;/H3&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you’d prefer not to manage FAISS lifecycle, Databricks &lt;STRONG&gt;Mosaic AI Vector Search&lt;/STRONG&gt; provides a managed index that:&lt;/P&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Creates an index from a Delta table with automatic incremental sync (Delta Sync Index), or lets you write vectors directly (Direct Vector Access).&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Stores and returns both vectors and associated metadata with &lt;STRONG&gt;Unity Catalog&lt;/STRONG&gt; governance, supports hybrid keyword+vector search, and scales to very large datasets (including new storage‑optimized endpoints).&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;You interact via SDK/REST/SQL (&lt;CODE class="qt3gz9f"&gt;vector_search()&lt;/CODE&gt;), and the service handles indexing, scaling, and syncing for you.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="qt3gz91 paragraph"&gt;Example (Python SDK) to query an index:&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;%pip install databricks-vectorsearch
dbutils.library.restartPython()

&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; databricks.vector_search.client &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; VectorSearchClient
client = VectorSearchClient()
index = client.get_index(index_name=&lt;SPAN class="hljs-string"&gt;"main.semantic.my_index"&lt;/SPAN&gt;)

results = index.similarity_search(
    query_text=&lt;SPAN class="hljs-string"&gt;"my query"&lt;/SPAN&gt;,  &lt;SPAN class="hljs-comment"&gt;# or query_vector=[...]&lt;/SPAN&gt;
    num_results=&lt;SPAN class="hljs-number"&gt;5&lt;/SPAN&gt;,
    columns=[&lt;SPAN class="hljs-string"&gt;"id"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"text"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"tags"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"date"&lt;/SPAN&gt;]
)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Quick checklist&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Use &lt;STRONG&gt;IndexIDMap&lt;/STRONG&gt; and &lt;STRONG&gt;int64 IDs&lt;/STRONG&gt; to align FAISS results with Delta metadata.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;For &lt;STRONG&gt;IVF/PQ&lt;/STRONG&gt;: &lt;STRONG&gt;train once&lt;/STRONG&gt; on representative data, then &lt;STRONG&gt;add incrementally&lt;/STRONG&gt;; plan occasional retraining if data distribution shifts.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Persist indexes with &lt;STRONG&gt;faiss.write_index/read_index&lt;/STRONG&gt;; convert GPU ↔ CPU when saving/loading.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Keep metadata in &lt;STRONG&gt;Delta&lt;/STRONG&gt;, governed by &lt;STRONG&gt;Unity Catalog&lt;/STRONG&gt;, and join by IDs for retrieval.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;For hands-off scaling and incremental sync, evaluate &lt;STRONG&gt;Mosaic AI Vector Search&lt;/STRONG&gt; instead of DIY FAISS.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;</description>
    <pubDate>Thu, 13 Nov 2025 14:31:26 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2025-11-13T14:31:26Z</dc:date>
    <item>
      <title>How to store &amp; update a FAISS Index in Databricks</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-store-amp-update-a-faiss-index-in-databricks/m-p/138918#M4439</link>
      <description>&lt;P&gt;I’m currently using FAISS in a Databricks notebook to perform semantic search in text data. My current workflow looks like this:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;encode ~10k text entries using an embedding model.&lt;/LI&gt;&lt;LI&gt;build a FAISS index in memory.&lt;/LI&gt;&lt;LI&gt;run similarity searches using index.search().&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;This works fine for 10k texts, but as the dataset grows, rebuilding the FAISS index every time is becoming slow. I’d like to Incrementally add new embeddings to the existing index instead of rebuilding from scratch.&lt;/P&gt;&lt;P&gt;Here’s a simplified snippet of what I’m doing now:&lt;/P&gt;&lt;PRE&gt;import faiss
import numpy as np

query_emb = embed_model.encode([query_text], normalize_embeddings=True)
query_emb = np.array(query_emb, dtype=np.float32)
faiss.normalize_L2(query_emb)
distances, indices = index.search(query_emb, top_N)&lt;/PRE&gt;&lt;UL class="lia-list-style-type-square"&gt;&lt;LI&gt;What is the best practice in Databricks to save/load FAISS indexes efficiently (preferably on DBFS or Delta)?&lt;/LI&gt;&lt;LI&gt;How can I safely add new embeddings incrementally to an existing FAISS index?&lt;/LI&gt;&lt;LI&gt;How do others handle metadata (like text or IDs) alongside the FAISS index for retrieval?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I’d be glad to have guidance or example patterns for a scalable semantic search pipeline in Databricks using FAISS.&lt;/P&gt;&lt;P&gt;#FAISS #SemanticSearch #VectorSearch&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 13 Nov 2025 12:33:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-store-amp-update-a-faiss-index-in-databricks/m-p/138918#M4439</guid>
      <dc:creator>ashfire</dc:creator>
      <dc:date>2025-11-13T12:33:45Z</dc:date>
    </item>
    <item>
      <title>Re: How to store &amp; update a FAISS Index in Databricks</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-store-amp-update-a-faiss-index-in-databricks/m-p/138938#M4440</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/187797"&gt;@ashfire&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P class="qt3gz91 paragraph"&gt;Here’s a practical path to scale your FAISS workflow on Databricks, along with patterns to persist indexes, incrementally add embeddings, and keep metadata aligned.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Best practice to persist/load FAISS indexes on Databricks&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Use &lt;STRONG&gt;faiss.write_index/read_index&lt;/STRONG&gt; to save/load the index as a single file on a UC Volume. This keeps I/O simple and fast for driver-side code. If you ever use a GPU index, convert it to CPU before writing, then back to GPU after reading:&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; faiss

&lt;SPAN class="hljs-comment"&gt;# Save to DBFS (CPU index required for write)&lt;/SPAN&gt;
cpu_index = faiss.index_gpu_to_cpu(index) &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; &lt;SPAN class="hljs-built_in"&gt;isinstance&lt;/SPAN&gt;(index, faiss.GpuIndex) &lt;SPAN class="hljs-keyword"&gt;else&lt;/SPAN&gt; index
faiss.write_index(cpu_index, &lt;SPAN class="hljs-string"&gt;"/dbfs/FileStore/semantic/faiss_index.faiss"&lt;/SPAN&gt;)

&lt;SPAN class="hljs-comment"&gt;# Load back&lt;/SPAN&gt;
index = faiss.read_index(&lt;SPAN class="hljs-string"&gt;"/dbfs/FileStore/semantic/faiss_index.faiss"&lt;/SPAN&gt;)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Keep any non-index artifacts (e.g., ID mapping or stats) in a separate file (pickle/JSON) next to your index if you aren’t using FAISS’s IDMap. Example pattern: &lt;CODE class="qt3gz9f"&gt;/Volume/myVolume/semantic/faiss_index.faiss&lt;/CODE&gt; +&lt;CODE class="qt3gz9f"&gt;/Volume/myVolume/&lt;/CODE&gt;&lt;CODE class="qt3gz9f"&gt;semantic/id_map.json&lt;/CODE&gt;. The I/O is efficient and simple to manage with notebook jobs on Databricks.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Safely adding embeddings incrementally to an existing FAISS index&lt;/H3&gt;
&lt;P class="qt3gz91 paragraph"&gt;There are two common cases:&lt;/P&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For &lt;STRONG&gt;Flat indexes&lt;/STRONG&gt; (e.g., IndexFlatL2/IndexFlatIP):&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;You can append vectors with &lt;CODE class="qt3gz9f"&gt;index.add(new_vecs)&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;If you need persistent, stable IDs for later lookups, wrap with **IndexIDMap** and use `add_with_ids` (IDs must be int64):
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; faiss
&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; numpy &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; np

base = faiss.IndexFlatIP(dim)  &lt;SPAN class="hljs-comment"&gt;# or IndexFlatL2&lt;/SPAN&gt;
index = faiss.IndexIDMap(base)

new_vecs = np.asarray(new_embeddings, dtype=np.float32)
new_ids  = np.asarray(new_int64_ids, dtype=np.int64)
index.add_with_ids(new_vecs, new_ids)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;To update existing vectors: remove the IDs you want to change and re-add the new embeddings with the same IDs:&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;ids_to_update = np.asarray(ids_list, dtype=np.int64)
index.remove_ids(ids_to_update)
index.add_with_ids(updated_vecs, ids_to_update)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For &lt;STRONG&gt;IVF/PQ indexes&lt;/STRONG&gt; (e.g., IndexIVFPQ):&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Train once on a representative sample (you don’t need the full dataset), then you can &lt;STRONG&gt;add&lt;/STRONG&gt; new embeddings incrementally without retraining the index. This is the usual pattern for large datasets where training on all vectors isn’t feasible.
&lt;DIV class="_7pq7t614 _7pq7t6cj wrz27r2 wrz27r0"&gt;
&lt;DIV class="xh5urp3 xh5urp1 xh5urp0" role="presentation" aria-label="Citation 5"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;If your data distribution drifts significantly over time, you can periodically retrain with an updated representative sample and rebuild during a maintenance window, but the common approach is “train once, add many.”
&lt;DIV class="_7pq7t614 _7pq7t6cj wrz27r2 wrz27r0"&gt;
&lt;DIV class="xh5urp3 xh5urp1 xh5urp0" role="presentation" aria-label="Citation 5"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Handling metadata alongside FAISS for retrieval&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Store your metadata in a &lt;STRONG&gt;Delta table&lt;/STRONG&gt; governed by &lt;STRONG&gt;Unity Catalog&lt;/STRONG&gt;. Keep a stable primary key you also use as the FAISS ID:&lt;/P&gt;
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Table schema example: &lt;CODE class="qt3gz9f"&gt;id STRING&lt;/CODE&gt;, &lt;CODE class="qt3gz9f"&gt;text STRING&lt;/CODE&gt;, &lt;CODE class="qt3gz9f"&gt;embedding ARRAY&amp;lt;FLOAT&amp;gt;&lt;/CODE&gt;, plus any attributes for filtering.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;When you do `index.search(...)` or `index.search`+IDMap lookups, use the returned IDs to join back to the Delta table:
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; numpy &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; np

&lt;SPAN class="hljs-comment"&gt;# Query&lt;/SPAN&gt;
q = np.asarray(query_emb, dtype=np.float32)
D, I = index.search(q.reshape(&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;, -&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;), top_k)
hit_ids = [&lt;SPAN class="hljs-built_in"&gt;int&lt;/SPAN&gt;(i) &lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; i &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; I[&lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;] &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; i != -&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;]

&lt;SPAN class="hljs-comment"&gt;# Fetch metadata via Spark/Delta&lt;/SPAN&gt;
&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; functions &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; F
hits_df = spark.table(&lt;SPAN class="hljs-string"&gt;"main.semantic.docs"&lt;/SPAN&gt;).where(F.col(&lt;SPAN class="hljs-string"&gt;"id"&lt;/SPAN&gt;).isin(hit_ids))
display(hits_df.orderBy(F.array_position(F.array(*[F.lit(x) &lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; x &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; hit_ids]), F.col(&lt;SPAN class="hljs-string"&gt;"id"&lt;/SPAN&gt;))))&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;
&lt;DIV class="go8b9g5 _7pq7t6ch"&gt;python&lt;/DIV&gt;
&lt;DIV class="_17yk06p0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you prefer to keep embeddings in Delta as well (for audits/versioning or hybrid pipelines), store them in &lt;CODE class="qt3gz9f"&gt;ARRAY&amp;lt;FLOAT&amp;gt;&lt;/CODE&gt;. FAISS still expects NumPy arrays at runtime, so load batches from Delta into memory when building/updating the index.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;A scalable semantic search pipeline pattern on Databricks (FAISS)&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Ingest data into a &lt;STRONG&gt;Delta table&lt;/STRONG&gt; with a stable &lt;STRONG&gt;primary key&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Use a &lt;STRONG&gt;batch job&lt;/STRONG&gt; to compute embeddings (Spark UDF, Model Serving, or a Python batch on the driver) and write results to a Delta table with &lt;CODE class="qt3gz9f"&gt;id&lt;/CODE&gt; + &lt;CODE class="qt3gz9f"&gt;embedding&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Maintain a &lt;STRONG&gt;FAISS index artifact&lt;/STRONG&gt; in DBFS or a UC Volume:
&lt;UL class="qt3gz98 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Initial build: read embeddings in batches, train if IVF/PQ, add vectors, then &lt;CODE class="qt3gz9f"&gt;write_index&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Incremental updates: detect new/changed rows (e.g., “last updated” column) and:
&lt;UL class="qt3gz99 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;For flat indexes: &lt;CODE class="qt3gz9f"&gt;add_with_ids&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;For updates: &lt;CODE class="qt3gz9f"&gt;remove_ids&lt;/CODE&gt;, then &lt;CODE class="qt3gz9f"&gt;add_with_ids&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;For IVF/PQ: ensure the index is trained; then &lt;CODE class="qt3gz9f"&gt;add&lt;/CODE&gt; new embeddings. Periodically evaluate drift and re-train if needed.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Keep a small &lt;STRONG&gt;ID map file&lt;/STRONG&gt; (if you don’t use IndexIDMap) or just rely on IndexIDMap with int64 IDs to avoid external mapping.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Consider Databricks Mosaic AI Vector Search (managed alternative)&lt;/H3&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you’d prefer not to manage FAISS lifecycle, Databricks &lt;STRONG&gt;Mosaic AI Vector Search&lt;/STRONG&gt; provides a managed index that:&lt;/P&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Creates an index from a Delta table with automatic incremental sync (Delta Sync Index), or lets you write vectors directly (Direct Vector Access).&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Stores and returns both vectors and associated metadata with &lt;STRONG&gt;Unity Catalog&lt;/STRONG&gt; governance, supports hybrid keyword+vector search, and scales to very large datasets (including new storage‑optimized endpoints).&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;You interact via SDK/REST/SQL (&lt;CODE class="qt3gz9f"&gt;vector_search()&lt;/CODE&gt;), and the service handles indexing, scaling, and syncing for you.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="qt3gz91 paragraph"&gt;Example (Python SDK) to query an index:&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;%pip install databricks-vectorsearch
dbutils.library.restartPython()

&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; databricks.vector_search.client &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; VectorSearchClient
client = VectorSearchClient()
index = client.get_index(index_name=&lt;SPAN class="hljs-string"&gt;"main.semantic.my_index"&lt;/SPAN&gt;)

results = index.similarity_search(
    query_text=&lt;SPAN class="hljs-string"&gt;"my query"&lt;/SPAN&gt;,  &lt;SPAN class="hljs-comment"&gt;# or query_vector=[...]&lt;/SPAN&gt;
    num_results=&lt;SPAN class="hljs-number"&gt;5&lt;/SPAN&gt;,
    columns=[&lt;SPAN class="hljs-string"&gt;"id"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"text"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"tags"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"date"&lt;/SPAN&gt;]
)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Quick checklist&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Use &lt;STRONG&gt;IndexIDMap&lt;/STRONG&gt; and &lt;STRONG&gt;int64 IDs&lt;/STRONG&gt; to align FAISS results with Delta metadata.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;For &lt;STRONG&gt;IVF/PQ&lt;/STRONG&gt;: &lt;STRONG&gt;train once&lt;/STRONG&gt; on representative data, then &lt;STRONG&gt;add incrementally&lt;/STRONG&gt;; plan occasional retraining if data distribution shifts.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Persist indexes with &lt;STRONG&gt;faiss.write_index/read_index&lt;/STRONG&gt;; convert GPU ↔ CPU when saving/loading.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;Keep metadata in &lt;STRONG&gt;Delta&lt;/STRONG&gt;, governed by &lt;STRONG&gt;Unity Catalog&lt;/STRONG&gt;, and join by IDs for retrieval.&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;For hands-off scaling and incremental sync, evaluate &lt;STRONG&gt;Mosaic AI Vector Search&lt;/STRONG&gt; instead of DIY FAISS.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;</description>
      <pubDate>Thu, 13 Nov 2025 14:31:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-store-amp-update-a-faiss-index-in-databricks/m-p/138938#M4440</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-11-13T14:31:26Z</dc:date>
    </item>
  </channel>
</rss>

