iyashk-DB
Databricks Employee
Databricks Employee

Hi @deane ,

I worked on a similar issue months earlier and the limit of 1024 is not configurable (no explicit way to increase the filter count), so my suggestion back then was to perform a two step process like filter for half of the filters once and the next half later. But what was internally happening back then was, we internally used to switch back to ANN without causing any kind of failure to the queries when the filter length exceeded. So to answer your questions:

  1. Is this batching approach reasonable for large filters?
    Yes, batching is a reasonable and commonly used pattern when your filter list exceeds the 1,024 elements per filter clause limit in the query API.
    If you’re using the SQL-like filter string (storage-optimized endpoints), you can split your IDs across multiple clauses in a single query, for example: id IN (...) OR id IN (...), ensuring each IN list stays within 1,024 items. (like I mentioned earlier). This reduces the number of separate round trips versus issuing entirely separate queries, subject to practical performance and expression size constraints.
    Databricks’ filtering is applied in the query itself (not as a post-filter), so the engine optimizes execution considering your filter scope.

  2. Can I rely on the ANN + HNSW algorithm to return consistent similarity scores for the same query vector, regardless of which other IDs are included in the filter? Or could the results vary depending on the set of IDs passed in each query?
    As I told, we fallback to ANN (to have similar performance and retrieval quality). Prefer ANN-only queries (query_type=ANN) when you need to compare raw vector similarity across batches; combine batch results by the returned distance/score and then take a global top-K.
    If you need hybrid ranking, run hybrid per batch, union results, and optionally do a secondary global rerank (for example, vector-only or your own application-level criteria) so the final order is consistent across the combined pool.