Databricks Community

priyam39 · ‎05-26-2025

I have one issue . I have created delta table and vector search index from the delta table. For a particular query if I am doing similarity search then sometimes I am getting the documents and sometimes I am not getting any documents.

For example

# query="Explain about the content of R-CarAI_ACC_UM_172_RWDT_r0p30.xlsx"

# query="Explain about the content of AI"

query="Explain about the content of 022_R-CarAI_acc_IPEX_Interrupt_table_r0p30.xlsx"

# query="Explain about the tab named register description of R-CarAI-ACC_UM_172_RWDT_r0p30.xlsx"

# docs=vector_search.similarity_search(query_text=query,num_results=10,columns=["file_name","page_info"])

docs=vector_search.similarity_search(query_text=query,num_results=10,columns=["content_type", "content", "summary" , "file_name"],filters={"file_name": ["022_R-CarAI_acc_IPEX_Interrupt_table_r0p30.xlsx"]})

Then sometimes it is working and I can get the excel files and sometimes it is returning no docs. Do you have any idea why Am I having this abnormal behaviour of the vector search index?

mark_ott · ‎11-05-2025

This inconsistent behavior in your Delta Table and vector search index is a common issue with semantic vector searches, especially when working with diverse or structured data like Excel file contents. There are several likely causes for why your similarity search sometimes returns documents and other times returns none, even with what appear to be similar queries:

1. Query Embedding and Matching

If your vector search index relies on text embeddings (such as from models like OpenAI, Hugging Face, etc.), small differences in query phrasing or terminology can result in very different embeddings, thus impacting similarity results.
Long or complex queries may produce embeddings that are less similar to any index entries, especially if the index was built using shorter, fragmented content chunks.

2. Chunking and Indexing Strategy

If you chunked your Excel files by row, page, or cell, the context available to the embedding may not be sufficient to match against high-level or document-level queries, causing matches to occasionally be missed.
Make sure that your chunk size and overlap settings when creating the delta table are tuned to balance between context and specificity.

3. Filtering Logic

Your query is sometimes using filters like:

python

filters={"file_name": ["022_R-CarAI_acc_IPEX_Interrupt_table_r0p30.xlsx"]}

If the actual file_name in the index contains extra spaces, casing differences, or slight spelling changes, your filter may not match any records—even if content exists.

4. Data Refresh or Index Inconsistency

If the delta table or vector index is being updated or re-indexed frequently, there is a risk that the vector store and the delta table can get temporarily out of sync. This would cause some searches to intermittently fail.

5. Vector Search Thresholds

Sometimes, the underlying vector search library uses an internal cutoff for similarity. If the scores do not exceed a threshold, it returns no results. Some libraries let you adjust this threshold or fallback to returning the top-k results regardless of the score.

6. Tokenization/Parsing Anomalies

If the Excel file is parsed differently in each run or different libraries are used (e.g., openpyxl vs. pandas), it could result in slightly different string content in the indexed chunks.

Recommendations to Fix or Investigate Further

Consistency in Filtering: Double-check that your file names, tab names, and filters are exactly as indexed (no extra spaces or case mismatches).
Adjust Chunking: Try larger or overlapping text chunks to preserve context, especially for content derived from tables or structured documents.
Check Embeddings: Compare the generated query embedding to the indexed chunk embeddings; ensure their cosine similarity is reasonable.
Lower Similarity Threshold: If possible, lower or disable any minimum similarity thresholds in the vector search.
Logs and Debugging: Log the queries and returned results from the vector store for analysis.
Re-index Carefully: Ensure that reindexing and updates are atomic—never serve queries during partial updates.

This intermittent retrieval generally points to a mismatch either at the filtering level or due to vector similarity settings and query embeddings. Careful review of your chunking, filters, and retrieval configuration should resolve the issue.