Databricks Community

samantha-wise · ‎05-19-2025

In the rapidly evolving and highly connected world of modern supply chains, organizations struggle with managing and consolidating vast amounts of data. One common issue that the supply chain sector faces is the challenge of mapping and consolidating taxonomies—structured categories of information—from multiple sources. These sources often come from various systems, each with its own naming conventions, formats, and noisy data. This results in a mismatch between terms and categories, which can slow down decision-making and increase operational costs.

The good news? Technologies like vector search and Large Language Models (LLMs) are reshaping how we handle these challenges, as we’ll explore in this blog:

The limitations of traditional methods like regex
How vector search provides a more flexible and scalable solution for semantic matching
The role of LLMs in enhancing data understanding and dynamic taxonomy generation
Best practices for implementing vector search and LLMs in a supply chain context

By the end of this blog, you'll understand why Vector Search and large language models are viable and superior alternatives to regex, especially for large-scale, complex data environments

Understanding the Data Problem: Noisy and Inconsistent Taxonomies

Consider the following data categories from a supply chain management system:

SOURCE_DIM: This includes data sources like Finance_System, Procurement_System, and Partner_Database.
REGION_NAME: These represent geographical regions such as Northwest, Southeast, East Coast, and Central London.
ROUTE_NAME: These describe distribution routes such as Route A, Southern Express, and Metro Supply Route.
DELIVERY_UNIT_NAME: This consists of delivery units such as Logistics Unit 1, Central Distribution Team, and West End Delivery.

In our example, one of the most challenging aspects is the delivery unit names, especially when comparing the "clean" version of the names (found in DELIVERY_UNIT_NAME_ORIGINAL) to the noisy, inconsistent versions (found in DELIVERY_UNIT_NAME_WITH_NOISE). The inconsistency makes it hard for automated systems to consolidate and map these names properly.

For example:

Rural Network Logistics is mapped to a noisy version: RN:IMDM Rural Network Logistics.
West Coast Delivery becomes WC:IMDM Regional West Coast Delivery.

Obviously, this is a simplified example for clarity, but as you can imagine, the data can come with different unpredicted variations and errors. In the rest of the blog, we will discuss the different ways we can solve this challenge.

The Role of Regular Expressions (Regex)

Regular expressions (regex) can be highly effective for cleaning and transforming noisy data into a standardized format. Regex patterns are rules that help identify and manipulate strings, allowing us to strip out unwanted prefixes, suffixes, or other noisy elements.

In our example, we might want to extract only the essential delivery unit names from DELIVERY_UNIT_NAME_WITH_NOISE and map them to the clean versions. A regex pattern like (London Central Greater Dispatch|GL:IMDM|IMDM London Central) can help us match the noisy inputs to the clean delivery unit names.

For example:

"RN:IMDM Rural Network Logistics" would be matched to "Rural Network Logistics".
"WC:IMDM Regional West Coast Delivery" would be matched to "West Coast Delivery".

This type of pattern matching is a vital first step toward cleaning and mapping the data. However, it's important to note that using regex alone can become a highly manual and challenging process when maintaining constantly changing data. Each time a new format or noise pattern appears (e.g., changes in prefixes, new naming conventions), regex patterns must be adjusted accordingly. This process can quickly become difficult to scale as your data grows and evolves over time.

As supply chains and systems introduce new data sources or change their internal taxonomies, keeping regex patterns up to date requires continuous oversight. This is where tools like vector search and large language models (LLMs) come into play. They help reduce the need for manual intervention and enable smarter, more adaptive data mapping and taxonomy consolidation.

Using Vector Search for Taxonomy Consolidation in Supplier Networks

While regex is a powerful tool for pattern matching and data cleaning, it falls short when it comes to mapping entities based on semantic similarity. This is where vector search comes in. In vector search, data points are represented as vectors in a high-dimensional space, where semantically similar entities are located closer together. This allows for more flexible and intelligent matching that goes beyond simple string similarity.

For example, a traditional search system might struggle to match "Southwest Supply Group" and "SW:IMDM Group Southwest Regional Supply" but vector search would recognize these as semantically similar. Our approach uses vector search to tackle this problem by representing each supplier name as a high-dimensional vector, allowing for similarity-based matching across systems.

Implementing Vector Search on Databricks: Step-by-Step Process

Initialize Vector Search Client and Ensure Endpoint is Ready

A VectorSearchClient is initialized to interact with Vector Search functionalities. This client manages endpoints and indexes for vector search operations.

from databricks.vector_search.client import VectorSearchClient



vsc = VectorSearchClient(disable_notice=True)

if not endpoint_exists(vsc, VECTOR_SEARCH_ENDPOINT_NAME):

      vsc.create_endpoint(name=VECTOR_SEARCH_ENDPOINT_NAME, endpoint_type="STANDARD")

wait_for_vs_endpoint_to_be_ready(vsc, VECTOR_SEARCH_ENDPOINT_NAME)

print(f"Endpoint named {VECTOR_SEARCH_ENDPOINT_NAME} is ready.")

Vector Index Creation

A delta sync vector index is created and kept in sync with the source Delta table, known as conformed_table_name, which serves as the authoritative source for supplier identities. This index enables efficient similarity searches, making it easier to identify matching suppliers across various business units.

from databricks.sdk import WorkspaceClient

import databricks.sdk.service.catalog as c


if not index_exists(vsc, VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname):

    print(f"Creating index {vs_index_fullname} on endpoint {VECTOR_SEARCH_ENDPOINT_NAME}...")

    vsc.create_delta_sync_index(

      endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,

      index_name=vs_index_fullname,

      source_table_name=source_table_fullname,

      pipeline_type="TRIGGERED",

      primary_key=primary_key,

      embedding_source_column=embedding_source_column,

      embedding_model_endpoint_name=embedding_model_endpoint_name

    )

    wait_for_index_to_be_ready(vsc, VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname)

  else:

    wait_for_index_to_be_ready(vsc, VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname)

    vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname).sync()

Similarity Search

The vector index is queried using a similarity search to find matching suppliers despite variations in their naming conventions. This approach captures subtle similarities that traditional keyword-based search methods might miss, such as abbreviations, name variations, or typographical errors.

query_text = "SW:IMDM Group Southwest Regional Supply"
region_name = "Southeast"
route_name = "Northern Loop"

results = vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname).similarity_search(

    query_text=query_text,

    columns=['REGION_NAME', 'ROUTE_NAME', 'DELIVERY_UNIT_NAME'],

    filters={"REGION_NAME": region_name, "ROUTE_NAME": route_name},

    num_results=1)

docs = results.get('result', {}).get('data_array', [])[0][2]

print(docs)

Vector Search on a batch of data

Using a User-Defined Function (UDF), we can integrate the similarity search functionality into Spark DataFrames. This enables the application of similarity searches directly within Spark processing pipelines, facilitating the integration of vector search capabilities into data transformation and analysis workflows.

@udf(StringType())

def similarity_search(query_text, region_name, route_name):

   results = vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname).similarity_search(

       query_text=query_text,

       columns=['REGION_NAME', 'ROUTE_NAME', 'DELIVERY_UNIT_NAME'],

       filters={"REGION_NAME": region_name, "ROUTE_NAME": route_name},

       num_results=1)

   docs = results.get('result', {}).get('data_array', [])[0][2]

   return docs



df_with_similarity = df.withColumn("similarity_results", similarity_search(df["DELIVERY_UNIT_NAME"], df["REGION_NAME"], df["ROUTE_NAME"]))

It's important to note that vector index —the core of vector search—also needs regular updating and syncing to remain effective. For more information on the different sync mode options that help keep your index up to date, see the following Databricks documentation.

Using LLM for Taxonomy Consolidation in Supplier Networks

Large Language Models (LLMs) have the ability to understand the broader context of data, which makes them far more flexible and adaptive. For example, if a new, previously unseen delivery unit name appears in the system, an LLM can infer whether it belongs to an existing category by analyzing the meaning behind the words, rather than relying on exact matches or predefined rules. This allows for smarter, more intuitive mapping of data, especially when dealing with inconsistent or noisy inputs.

Moreover, LLMs excel at dynamic taxonomy generation, meaning they can adapt to changing data landscapes. When new delivery unit names or categories emerge, LLMs can automatically generate appropriate taxonomies or suggest merging similar categories, even when naming conventions vary significantly. This makes LLMs an essential tool for managing evolving data in real-time, improving both accuracy and efficiency.

In our case in order to consolidate the delivery unit names we used Llama 3.1 and the following prompt:

"""

Imagine you are trying to consolidate the delivery unit names that can have many variations. Your task is to map the delivery unit {unit} to one of the following categories:

["Logistics Unit 1", "Supply Team A", "Delivery Group North", "Central Distribution Team", "East Logistics Hub", "West End Delivery", "Urban Supply Group", "Rural Delivery Unit", ... ]

*Do no answer with None

*Must find the closest label to the input

return Your output in json format. Do not add extra text

"""

After running the LLM consolidation to the whole dataframe, the results were:

RN:IMDM Rural Network Logistics	{"Delivery Unit": "Rural Network Logistics"}
EX:IMDM Delivery Express Unit	{"delivery_unit": "Delivery Express Unit"}
Express Route Distribution	None

This approach worked quite well, but as you can see, the dictionary keys are inconsistent, and the suggestions in the prompt are not strictly followed everywhere as some Nones are returned. This inconsistency could lead to issues with data cleaning and manipulation later on, which very nicely takes us to schema enforcing capabilities of LLMs.

Schema enforcing capabilities of LLMs

LLMs, by nature, generate unstructured text. However, for many practical applications—especially in automated systems—you need structured data that conforms to a specific format. Enforcing a JSON schema within Large Language Models (LLMs) with function calling is a powerful approach to ensure structured, predictable outputs.

In this case, enforcing a JSON schema means guiding the LLM to produce consistent, predictable outputs that match a specific format, like {"unit": "Delivery Group North"}. This ensures that instead of generating free-form text or inconsistent key-value pairs, the model consistently produces the desired structure.

Here's what's happening:

Input variations: The LLM is given different delivery unit names, like "Coastal Logistics" or "Delivery Group North."
Schema enforcement: Rather than allowing the model to generate free-form text or inconsistent formats, we enforce a strict structure (in this case, a JSON format).
Consistent outputs: Regardless of the input, the output is formatted as {"unit": "<Delivery Unit Name>"}. This standardization prevents issues that might arise from inconsistent keys (like "Delivery Unit" vs. "unit") or missing fields, which can cause data processing errors.

The tool specification for our use case is presented below:

tools = [

   {

       "type": "function",

       "function": {

           "name": "_taxonomy_consolidation",

           "description": "Consolidate the taxonomy of delivery units",

           "parameters": {

               "type": "object",

               "properties": {

                   "unit": {

                       "type": "string",

                       "enum": ["Logistics Unit 1", "Supply Team A", "Delivery Group North", "Central Distribution Team",

   "East Logistics Hub", "West End Delivery", "Urban Supply Group", "Rural Delivery Unit", "Coastal Logistics", . . . ]

                   }

               },

               "required": ["unit"]

           }

       }

   }

]

This setup allows the tool to take a delivery unit name (as a string) and map it to one of the allowed categories. The enum ensures the input is restricted to specific known values, which prevents errors or inconsistent data. When called, this function will return a consolidated or validated version of the input, ensuring the taxonomy stays clean and standardized. The results can be seen below:

GL:IMDM London Central Greater Dispatch	{ "unit": "Greater London Central Dispatch" }
EX:IMDM Delivery Express Unit	{"unit": “Delivery Express Unit"}
WC:IMDM Regional West Coast Delivery	{ "unit": "Regional West Coast Delivery" }

For the SQL Heavy Users - You can do it ALL in SQL!

For users who prefer working directly with SQL on Databricks, the ai_query function is your friend. This SQL-native approach allows you to leverage the capabilities of large language models (LLMs) directly within your SQL queries, seamlessly integrating AI-powered data processing into your existing workflows. Below, you can see how you can call ai_query to perform the whole process with the LLM and the schema enforcement, effectively replicating what we built in the previous section, just by using SQL! How cool is that!

CREATE OR REPLACE TABLE {CATALOG}.{SCHEMA}.tool_results AS (

 SELECT

     DELIVERY_UNIT_NAME,

     AI_QUERY(

       "databricks-meta-llama-3-3-70b-instruct",

       CONCAT("Imagine you are trying to consolidate the delivery unit names that can have many variations. Your task is to map the delivery unit {unit} to one of the following categories:

[

   Logistics Unit 1,

   Supply Team A,

   Delivery Group North,

   Central Distribution Team,

   East Logistics Hub,

  ……

]

*Do no answer with None

*Must find the closest label to the input

return Your output in json format. Do not add extra text}", DELIVERY_UNIT_NAME),

            responseFormat =>'{

                       "type": "json_schema",

                       "json_schema":

                         {

                          "name": "supply_unit_extraction",

                          "schema":

                            {

                             "type": "object",

                             "properties":

                              {

                               "unit": { "type": "string" }

                              }

                            },

                           "strict": true

                         }

                     }',

       modelParameters => named_struct('max_tokens', 100,'temperature', 0)

     ) as response

   FROM{CATALOG}.{SCHEMA}.raw_supplier_data

)

Conclusion and Next Steps

While regex remains useful for initial data cleaning and simple pattern matching, it falls short in handling the dynamic and noisy data prevalent in modern supply chains. Vector search, by leveraging semantic similarity, excels at identifying subtle variations in entity names, providing greater flexibility and accuracy in matching.

LLMs, on the other hand, offer a deeper understanding of data and are capable of dynamically adapting to new or unseen categories without manual intervention. By enforcing JSON schemas and generating structured outputs, LLMs ensure that data remains clean, consistent, and ready for further analysis. This combination significantly reduces manual effort, increases operational efficiency, and allows organizations to maintain a cohesive taxonomy, ultimately leading to better decision-making and improved business outcomes.

What are your thoughts on this approach? Have you faced challenges in implementing these technologies? Join the conversation by sharing your experiences and insights in the comments section below!

Databricks Community

Leveraging Vector Search & LLMs for Smarter Data Mapping

The Role of Regular Expressions (Regex)

Using Vector Search for Taxonomy Consolidation in Supplier Networks

Implementing Vector Search on Databricks: Step-by-Step Process

Initialize Vector Search Client and Ensure Endpoint is Ready

Vector Index Creation

Similarity Search

Vector Search on a batch of data

Using LLM for Taxonomy Consolidation in Supplier Networks

Schema enforcing capabilities of LLMs

For the SQL Heavy Users - You can do it ALL in SQL!

Conclusion and Next Steps

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks