cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Databricks Vector Search Algorithm

deane
Visitor

I’m working on using Databricks vector search combined with full-text search in my application. I want to filter queries by the id field in my vector search index. I noticed that there is a limit of 1,024 IDs per query when using filters.

If I need to filter on more than 1,024 IDs, my current idea is to run multiple queries in batches and then combine the results.

My questions are:

  1. Is this batching approach reasonable for large filters?

  2. Can I rely on the ANN + HNSW algorithm to return consistent similarity scores for the same query vector, regardless of which other IDs are included in the filter? Or could the results vary depending on the set of IDs passed in each query?

Thanks in advance for any insights!

1 REPLY 1

iyashk-DB
Databricks Employee
Databricks Employee

Hi @deane ,

I worked on a similar issue months earlier and the limit of 1024 is not configurable (no explicit way to increase the filter count), so my suggestion back then was to perform a two step process like filter for half of the filters once and the next half later. But what was internally happening back then was, we internally used to switch back to ANN without causing any kind of failure to the queries when the filter length exceeded. So to answer your questions:

  1. Is this batching approach reasonable for large filters?
    Yes, batching is a reasonable and commonly used pattern when your filter list exceeds the 1,024 elements per filter clause limit in the query API.
    If you’re using the SQL-like filter string (storage-optimized endpoints), you can split your IDs across multiple clauses in a single query, for example: id IN (...) OR id IN (...), ensuring each IN list stays within 1,024 items. (like I mentioned earlier). This reduces the number of separate round trips versus issuing entirely separate queries, subject to practical performance and expression size constraints.
    Databricks’ filtering is applied in the query itself (not as a post-filter), so the engine optimizes execution considering your filter scope.

  2. Can I rely on the ANN + HNSW algorithm to return consistent similarity scores for the same query vector, regardless of which other IDs are included in the filter? Or could the results vary depending on the set of IDs passed in each query?
    As I told, we fallback to ANN (to have similar performance and retrieval quality). Prefer ANN-only queries (query_type=ANN) when you need to compare raw vector similarity across batches; combine batch results by the returned distance/score and then take a global top-K.
    If you need hybrid ranking, run hybrid per batch, union results, and optionally do a secondary global rerank (for example, vector-only or your own application-level criteria) so the final order is consistent across the combined pool.