Databricks Community

KyraWulffert · ‎10-24-2024

Detecting fraudulent purchases presents a significant challenge in industries with complex procurement processes, such as manufacturing, utilities, or public sector organisations. Vendors might engage in overcharging, false invoicing, or submitting duplicate claims, which can lead to substantial financial losses. Traditional methods often struggle to catch subtle anomalies in large transaction volumes and intricate supply chains. The issue is further complicated because fraud is highly imbalanced—comprising only a small fraction of data—and product data is typically high-dimensional, with distinguishing features that show only slight deviations, making detection even harder.

Machine Learning (ML) has proven to be a powerful tool in this area, but not every technique works well for every situation. In recent years, embeddings and Large Language Models (LLMs) have emerged as new methods that can help improve fraud detection accuracy.

In this blog, we explore how machine learning techniques, particularly leveraging embeddings and large language models (LLMs), can improve fraud detection by identifying outliers and patterns that are otherwise difficult to spot. We'll begin with the traditional machine learning approach, move to an LLM-driven strategy, and conclude with a hybrid approach that combines the strengths of both.

Runnable notebooks with all the code sections are available in the Databricks blog’s Github repo.

Why embeddings?

Embeddings play a crucial role in fraud detection by transforming complex, high-dimensional data such as natural language like product descriptions or purchase details into dense vector representations. These embeddings capture the semantic relationships between items, allowing us to compare them meaningfully.

For example, products purchased from the same vendor might have similar embeddings, while fraudulent or anomalous purchases could have embeddings that stand out from the norm.

Figure 1. Visualisation of Product Embeddings in 2D Space: Normal products are represented in blue, while the outlier product is shown in red. Figure 1. Visualisation of Product Embeddings in 2D Space: Normal products are represented in blue, while the outlier product is shown in red.

By applying machine learning algorithms on these embeddings, we can detect patterns and outliers, helping to identify potentially fraudulent transactions that deviate from expected behaviour.

ML for anomaly detection on embeddings

In our analysis of anomaly detection in embeddings, we explored several approaches, focusing primarily on dimensionality reduction, clustering algorithms, and traditional anomaly detection methods such as Isolation Forests.

Dimensionality reduction methods, like Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour Embedding (t-SNE), help by reducing the number of dimensions in the data.

PCA effectively reveals global structures and is computationally efficient, but it may miss nuanced patterns in data.
t-SNE is better at showing detailed relationships but can be computationally intensive with large datasets.

Clustering algorithms group similar data points together, which helps in spotting outliers. For example, k-Means is easy to use and works well with large data but assumes that clusters are round. DBSCAN can find clusters of any shape and is good at handling outliers, but it can be tricky with data with different densities. Isolation Forest works by randomly partitioning data points, making it effective for high-dimensional data. However, during our analysis, the algorithm performed the worst.

Each method has its strengths and weaknesses, so the best choice depends on what you need for your specific data.

Applying ML for identifying anomalous purchases

In this blog, we analyse historical purchase data from a procurement department to detect anomalies in the products purchased from each vendor. To achieve this, we first need to ensure that we only consider vendors from which at least two unique products were purchased, as having a more extensive set of products provides a richer context for anomaly detection.

We start by embedding these products into a high-dimensional space using an embedding model (e.g. gte-large). Once embedded, we apply an anomaly detection algorithm to identify products that significantly deviate from the norm. This approach helps us pinpoint unusual purchases that may indicate potential fraud or irregularities within the vendor's transaction history.

Figure 2. Pipeline for anomaly detection on purchase data using ML Figure 2. Pipeline for anomaly detection on purchase data using ML

We will now demonstrate how to use embeddings for anomaly detection in purchase data. To simplify and visualise the process, we will apply PCA.

PCA is a method that simplifies high-dimensional data by reducing its dimensions while preserving key information. PCA achieves this by transforming the data into a new set of axes called principal components. These components are the directions in which the data varies the most. The first principal component captures the highest variance, the second captures the next highest variance, and this continues with each subsequent component. We can then measure the reconstruction error by reconstructing the data from these principal components and comparing it to the original embeddings. High reconstruction errors indicate that some embeddings may differ significantly from the majority, thus highlighting potential anomalies.

In the example below, we visualise the two main PCA components of the embeddings of products purchased from a vendor. The colour gradient shows the magnitude of the reconstruction error; the higher, the worse. For the particular set of products purchased from this vendor, `Cloud Software` has a higher reconstruction error and is considered an anomaly based on the error threshold we set (99th percentile).

Figure 3. PCA of Product Embedding with Reconstruction Error for products purchased from two different vendors. Left `Cloud Software` is identified as anomalous correctly. Right `Bakery Tools` and `Delivery Fees` are identified as anomalous. Figure 3. PCA of Product Embedding with Reconstruction Error for products purchased from two different vendors. Left `Cloud Software` is identified as anomalous correctly. Right `Bakery Tools` and `Delivery Fees` are identified as anomalous.

In the process of identifying anomalies using PCA, setting the threshold for reconstruction error is crucial and can significantly impact the results. This threshold determines which reconstruction errors are considered anomalous. A common approach is to choose a high percentile of the reconstruction errors, such as the 99th percentile, to define anomalies. However, this threshold is highly sensitive, and its selection requires careful tuning. Setting it too low might result in too many false positives, identifying normal products as anomalies, while setting it too high might miss actual anomalies. Other methods to set this threshold include using domain-specific knowledge to set a practical value or employing cross-validation techniques to determine the most effective threshold for your specific dataset. Fine-tuning this threshold is essential to balance detecting true anomalies and minimising false positives.

Limitations of the ML approach

While PCA is a powerful anomaly detection tool, it comes with limitations. One significant drawback is that it may flag items as anomalies based on their reconstruction error without considering contextual factors. For example, suppose a vendor's product list includes `delivery charges` alongside other products. In that case, PCA might identify `delivery charges` as anomalous, even though they are a standard part of the vendor’s offerings. This potential false positive happens because PCA primarily focuses on the numerical patterns in the data without understanding the practical context or business logic.

Furthermore, traditional ML algorithms for anomaly detection, including PCA, often lack interpretability and provide limited explanations beyond simply indicating that a threshold has been surpassed. They don’t offer clear reasons why a specific product is flagged as anomalous, making it difficult to understand and validate the results, particularly in cases where the detected anomalies might actually be normal, context-specific variations. Here is where Generative AI (GenAI) can make a difference, as it offers advanced contextual understanding and the ability to provide more insightful explanations for detected anomalies.

GenAI for anomaly detection

Large Language Models (LLMs) provide a sophisticated approach to anomaly detection by leveraging advanced text processing capabilities.

Unlike traditional methods that may overlook contextual nuances, LLMs can analyse product descriptions and other textual data with a deeper understanding of each vendor’s typical offerings. This capability allows LLMs to assess whether a flagged product is genuinely anomalous or if it aligns with the vendor’s usual product range.

Additionally, LLMs can offer explanations for why a product is considered anomalous, enhancing the clarity and interpretability of the results. This approach enables a more in-depth analysis of complex data and generates more actionable insights.

Although the PCA model detects anomalous purchases, there might be identified purchases that are not anomalous, giving a context or a set of instructions. The solution can be further improved by instructing an LLM to evaluate the vendor with anomalous transactions based on the PCA model score and to provide a reason for the evaluation output on whether products purchased from those vendors are identified as anomalous or not.

Applying GenAI to identify anomalous purchases

One of the most straightforward approaches to using LLMs for this use case is employing few-shot learning through tailored prompts. This method involves providing the model with a few examples of what constitutes an anomaly and what does not, guiding the LLM to recognise patterns and outliers in new data. By using specific prompts that illustrate different scenarios, the LLM can leverage its pretrained knowledge to make informed decisions about whether a product or behaviour deviates significantly from the norm. This approach is relatively simple to implement and can quickly yield valuable insights into potential anomalies based on historical data and contextual examples provided in the prompt.

Example prompt for a list of products purchased from a vendor:

OUTLIER_PROMPT_TEMPLATE = """Imagine you are analyzing a list of products offered by a fictional company. The company specializes in certain types of products, but the specifics are not given. Use reasonable assumptions about what types of products a company with a given focus might offer. Your task is to identify any outliers in the list—products that do not fit well with what the company would typically provide. If the product list mainly suggests a certain type of product (like furniture or fruit), identify products that significantly differ from the rest. If all products could reasonably fit into a general theme or focus area, do not list any outliers. 
*DO NOT consider as outliers any transportation, delivery, courier or freight services or charges as they are usualy related to the delivery of parts a company produces.* 
Your output should be in JSON format.

Examples:
- Products: ["apple", "kiwi", "strawberry", "bread", "COURIER SERVICES"], Output: {{"outliers": ["bread"], "reason": "bread is not a type of fruit, which is what the company seems to specialize in."}}
- Products: ["chair", "table", "sofa", "bed", "refrigerator", "tree"], Output: {{"outliers": ["refrigerator", "tree"], "reason": "refrigerator and tree are not typical furniture items, which appears to be the company's focus."}}
- Products: ["blue", "red", "yellow", "green", "freight services"], Output: {{"outliers": [], "reason": "All items are colors, and could plausibly be provided by the company."}}

# Products
{products}

"""

Example output for the list of products above:

The example shows that the LLM provides valuable insights and explanations for identifying anomalous product purchases. However, the results can sometimes be verbose and unstructured and may not always adhere to the specified JSON format in the prompt. This lack of structure can make it challenging to directly integrate the findings into automated systems or dashboards. Here is where the output quality can be improved by using the tools API. It is possible to provide a strict JSON schema for the output, and the Foundation Model API inference service ensures that the model's output either adheres to this schema or returns an error if this is not possible.

Example using function calling tools to enforce the JSON format in the output:

tools = [
   {
       "type": "function",
       "function": {
           "name": "_outlier_detection",
           "description": "Identifies outliers in the list of products",
           "parameters": {
               "type": "object",
               "properties": {
                   "outliers": {
                       "type": "array",
                       "items": {"type": "string"},
                   },
                   "reason": {
                       "type": "string",
                       "description": "Reason for the item to be identified as an anomaly"
                   },                   
               },
               "required": [
                   "outliers",
                   "reason"
               ],
           },
       },
   },
]




def prompt_with_outlier_tool(products: List[str]):
   # Convert the list of products to a string format suitable for the LLM
   products_str = "\n".join(products)
   prompt = OUTLIER_PROMPT_TEMPLATE.format(products=products_str)
   return call_chat_model(prompt, tools=tools)

Using the function calling tools, we now have a consistently structured output:

Hybrid approach

A hybrid approach that combines traditional anomaly detection models with LLMs can offer a more comprehensive solution for identifying anomalies.

In this architecture, embeddings first pass through a traditional anomaly detection model, such as PCA or clustering algorithms, to detect potential outliers based on statistical properties and patterns. This step provides a preliminary filter for identifying anomalies. Subsequently, the results are analysed by an LLM, which can offer deeper contextual understanding and explanations. The LLM can clarify why a particular product is flagged anomalous, addressing the limitations of traditional models, which may lack interpretability. This two-step process leverages the strengths of both approaches: the statistical rigour of traditional models and the detailed, context-aware insights from LLMs.

Figure 4. Anomaly Detection Hybrid Solution Pipeline Figure 4. Anomaly Detection Hybrid Solution Pipeline

In our example with purchase data, this hybrid approach has notably reduced false positives, demonstrating its effectiveness in providing more accurate and actionable results. One example is that vendors flagged to have anomalous purchased products due to delivery and transportation fees could be identified as false positives by the LLM.

Figure 5. The LLM detects False Positive anomalies given by the traditional ML anomaly detector Figure 5. The LLM detects False Positive anomalies given by the traditional ML anomaly detector

The solution has been evaluated based on an evaluation dataset and human feedback. For the hybrid approach, the evaluation can be augmented by using the Mosaic AI Agent Framework with out-of-the-box metrics on the LLM output (like toxicity, token count, and latency) or by adding LLM as a judge for custom or extended metrics.

Why use a hybrid approach instead of just using LLMs?

While LLMs alone can offer strong anomaly detection and provide rich context, they can be computationally expensive, especially when dealing with large volumes of procurement purchases. The hybrid approach ensures that a more efficient model (like a traditional ML anomaly detection model) handles the bulk of anomaly detection, while LLMs are used for adding meaningful context and explanations for edge cases. This balance helps maintain accuracy and interpretability while keeping costs lower and making the solution more scalable.

Conclusion and next steps

In this blog, we demonstrated how traditional machine learning models and Generative AI (GenAI) can complement each other's strengths. By combining traditional anomaly detection ML models with the interpretative power of LLMs, businesses can streamline identifying anomalous purchases. This approach can significantly reduce the time it takes to flag potentially fraudulent transactions, highlighting just one of many potential applications in the industry.

However, it's important to consider the return on investment (ROI) when integrating LLMs into anomaly detection systems, as not all applications may yield significant benefits. Scenarios where reducing false positives is important, and where providing clear explanations for flagged transactions is crucial, can benefit greatly from this integration.

As for the next steps and improvements, one option could involve incorporating additional context from vendor contracts or related documents to enhance the decision-making process.

Another option could be to fine-tune a smaller LLM on a dataset of proven fraudulent purchases to enhance the model’s ability to detect anomalies. However, this requires a high-quality dataset with thousands of examples, which may be difficult to obtain without employing synthetic data generation techniques.

What are your thoughts on this hybrid approach? Have you faced challenges in implementing these technologies? Join the conversation by sharing your experiences and insights in the comments section below!

Databricks Community

Anomaly detection using embeddings and GenAI

Why embeddings?

ML for anomaly detection on embeddings

Applying ML for identifying anomalous purchases

Limitations of the ML approach

GenAI for anomaly detection

Applying GenAI to identify anomalous purchases

Hybrid approach

Why use a hybrid approach instead of just using LLMs?

Conclusion and next steps

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks