Databricks Community

620139 · ‎01-01-2025

I have been able to successfully implement a Databricks vector search index with metadata filtering (How to create and query a vector search index | Databricks on AWS).

However, I am facing a challenge when implementing a more advanced filtering mechanism.

In my setup, I have a metadata column in the index that contains an array of strings. I need to create a filter that identifies matches based on the intersection between an input array and the index array. Specifically, a match should occur if the intersection returns at least one common value.

I don't see a straightforward way to do this with the existing Databricks vector search filter options.

Thanks for any advice!

Walter_C · ‎01-02-2025

Currently, the Databricks vector search filter options do not directly support filtering based on the intersection of arrays.

620139 · ‎01-02-2025

Yes, I see that...

Are there any known work arounds? Some combination of existing filters or a code customization? This does not seem to be an uncommon search pattern...

Walter_C · ‎01-02-2025

Here is an example of how you can implement this in Python:


# Step 1: Retrieve the data
results = index.similarity_search(query_text="your_query", columns=["id", "metadata_column"], num_results=100)

# Step 2: Define the custom filtering function
def filter_by_intersection(results, input_array):
    filtered_results = []
    for result in results:
        metadata_array = result["metadata_column"]
        if any(item in input_array for item in metadata_array):
            filtered_results.append(result)
    return filtered_results

# Step 3: Apply the custom filtering function
input_array = ["value1", "value2", "value3"]
filtered_results = filter_by_intersection(results, input_array)

# The filtered_results now contain only the entries where the intersection is non-empty

By following these steps, you can achieve the desired filtering mechanism based on the intersection of arrays. This solution allows you to leverage the existing Databricks vector search capabilities while implementing custom logic to meet your specific requirements.

620139 · ‎01-02-2025

The above solution is effectively a post-search filter, which would reduce the number of results returned. I am looking for a solution that performs the filtering on the index itself.

txti · ‎01-14-2025

Hi. You can apply a filter on any metadata field in the index.
See the "Use filters on queries" section here: How to create and query a vector search index | Databricks on AWS
The JSON filter syntax takes some getting used to but is flexible. Here's a snippet that shows how to do this:

SEARCH_FILTER = {
    "language": "English",
    "source_types LIKE": "News"
    }

# Limit to English News publications
results = vs.similarity_search(
    query_text="Articles that discuss GenAI ethics",
    filters=SEARCH_FILTER ,
    num_results=4
    )

620139 · ‎01-15-2025

There is no filter operator based on the intersection of arrays.

txti · ‎01-16-2025

I see, did not read your question carefully enough.
If I now understand your requirement correctly, this syntax (from docs) should do the trick:

No filter operator specified

Filter checks for an exact match. If multiple values are specified, it matches any of the values.

{"id": 200} {"id": [200, 300]}

Walter_C · ‎01-02-2025

Allow me to look further and see if there is any additional approach.

Walter_C · ‎01-03-2025

Unfortunately I was not able to find any way around that with the proposed solution above

620139 · ‎01-03-2025

Thanks for you help...

Walter_C · ‎01-03-2025

sure, happy to help, let us know in case you have additional questions

Databricks Community

Help with Databricks vector search index advanced metadata filtering

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!