cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Help with Databricks vector search index advanced metadata filtering

620139
New Contributor III

I have been able to successfully implement a Databricks vector search index with metadata filtering (How to create and query a vector search index | Databricks on AWS).

However, I am facing a challenge when implementing a more advanced filtering mechanism.

In my setup, I have a metadata column in the index that contains an array of strings. I need to create a filter that identifies matches based on the intersection between an input array and the index array. Specifically, a match should occur if the intersection returns at least one common value.

I don't see a straightforward way to do this with the existing Databricks vector search filter options.

Thanks for any advice!

8 REPLIES 8

Walter_C
Databricks Employee
Databricks Employee

Currently, the Databricks vector search filter options do not directly support filtering based on the intersection of arrays.

620139
New Contributor III

Yes, I see that...

Are there any known work arounds? Some combination of existing filters or a code customization? This does not seem to be an uncommon search pattern... 

Walter_C
Databricks Employee
Databricks Employee

Here is an example of how you can implement this in Python:


# Step 1: Retrieve the data
results = index.similarity_search(query_text="your_query", columns=["id", "metadata_column"], num_results=100)

# Step 2: Define the custom filtering function
def filter_by_intersection(results, input_array):
    filtered_results = []
    for result in results:
        metadata_array = result["metadata_column"]
        if any(item in input_array for item in metadata_array):
            filtered_results.append(result)
    return filtered_results

# Step 3: Apply the custom filtering function
input_array = ["value1", "value2", "value3"]
filtered_results = filter_by_intersection(results, input_array)

# The filtered_results now contain only the entries where the intersection is non-empty

By following these steps, you can achieve the desired filtering mechanism based on the intersection of arrays. This solution allows you to leverage the existing Databricks vector search capabilities while implementing custom logic to meet your specific requirements.

620139
New Contributor III

The above solution is effectively a post-search filter, which would reduce the number of results returned. I am looking for a solution that performs the filtering on the index itself.

Walter_C
Databricks Employee
Databricks Employee

Allow me to look further and see if there is any additional approach.

Walter_C
Databricks Employee
Databricks Employee

Unfortunately I was not able to find any way around that with the proposed solution above

620139
New Contributor III

Thanks for you help...

Walter_C
Databricks Employee
Databricks Employee

sure, happy to help, let us know in case you have additional questions

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group