Wednesday - last edited Thursday
I have been able to successfully implement a Databricks vector search index with metadata filtering (How to create and query a vector search index | Databricks on AWS).
However, I am facing a challenge when implementing a more advanced filtering mechanism.
In my setup, I have a metadata column in the index that contains an array of strings. I need to create a filter that identifies matches based on the intersection between an input array and the index array. Specifically, a match should occur if the intersection returns at least one common value.
I don't see a straightforward way to do this with the existing Databricks vector search filter options.
Thanks for any advice!
Thursday
Currently, the Databricks vector search filter options do not directly support filtering based on the intersection of arrays.
Thursday
Yes, I see that...
Are there any known work arounds? Some combination of existing filters or a code customization? This does not seem to be an uncommon search pattern...
Thursday
Here is an example of how you can implement this in Python:
# Step 1: Retrieve the data
results = index.similarity_search(query_text="your_query", columns=["id", "metadata_column"], num_results=100)
# Step 2: Define the custom filtering function
def filter_by_intersection(results, input_array):
filtered_results = []
for result in results:
metadata_array = result["metadata_column"]
if any(item in input_array for item in metadata_array):
filtered_results.append(result)
return filtered_results
# Step 3: Apply the custom filtering function
input_array = ["value1", "value2", "value3"]
filtered_results = filter_by_intersection(results, input_array)
# The filtered_results now contain only the entries where the intersection is non-empty
By following these steps, you can achieve the desired filtering mechanism based on the intersection of arrays. This solution allows you to leverage the existing Databricks vector search capabilities while implementing custom logic to meet your specific requirements.
Thursday
The above solution is effectively a post-search filter, which would reduce the number of results returned. I am looking for a solution that performs the filtering on the index itself.
Thursday
Allow me to look further and see if there is any additional approach.
yesterday
Unfortunately I was not able to find any way around that with the proposed solution above
yesterday
Thanks for you help...
yesterday
sure, happy to help, let us know in case you have additional questions
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group