Databricks Community

maria_zervou · ‎05-15-2024

Authors: Lara Rachidi & Maria Zervou

Introduction

Welcome to our technical blog on the challenges encountered when building and deploying Retrieval-Augmented Generation (RAG) applications. RAG is a GenAI technique used to incorporate relevant data as context to a large language model (LLM) without the need to further train the LLM (known as “fine-tuning”).

While RAG applications can transform vast amounts of raw data into insightful, context-rich answers, developers often face several hurdles that can impact the performance, security, and reliability of these applications.

In this series, we aim to unpack common challenges that you might encounter when implementing RAG solutions.

Challenge 1 - “My RAG app is slow at generating an answer”
Challenge 2 - “I cannot update and delete data from the vector store”
Challenge 3 - “The app retrieves sensitive data that users should not have access to”
Challenge 4 - “My retriever is returning irrelevant documents”
Challenge 5 - “I don't trust the quality of the content produced by the RAG app”

By the end of this post, you should have a comprehensive understanding of the potential setbacks in RAG application development and practical strategies to overcome them, ensuring your applications are not only functional but also secure and efficient.

Quick Primer - RAG Application Architecture

Before delving into specific challenges associated with building RAG applications, it's essential to understand the underlying architecture and how it operates.

A RAG application is a complex system that integrates many components to achieve production-level quality outputs. RAG shines in use cases where the relevant information is continuously evolving (e.g. customer order data), or where the model encounters unfamiliar data outside its training set. Unlike traditional prompt engineering, RAG can adapt to provide up-to-date and contextually correct outputs without the need to pre-train or fine-tune the model.

RAG operates by scanning a large corpus of documents to identify content relevant to a specific query. This is achieved through a vector database where queries and documents are converted into high-dimensional vectors. The similarity between these vectors determines the relevance of documents to the query.

When a user raises a query, the system retrieves the pertinent data from the vector database to provide context to the LLM. This relevant data is then used to augment the LLM prompt, ensuring the generated responses are tailored and accurate.

What can go wrong with RAG applications

Building a RAG application involves several critical steps:

Embedding the source documents and the query.
Conducting a similarity search to identify and retrieve relevant content.
Integrating the query and the retrieved data to formulate the final prompt for response generation.

Users might encounter challenges anywhere in the process, and so each step - including the operational aspects - requires careful consideration.

Step 0 - Establish the initial benchmark

Before troubleshooting and iterating on the RAG application by optimizing each component, it’s crucial to start with a benchmark to compare against - as is usually done for machine learning projects. Without a benchmark, you will not be able to evaluate if the solution you have built is performing well and if it is demonstrating business impact to your stakeholders.

In order to build a benchmark for RAG, you should first deploy RAG components in the most basic setup, without tuning configurations. Once this benchmark is in place, you can iterate and monitor the quality of responses.

Challenge 1 - “My RAG app is slow at generating an answer”

Latency can be experienced at different steps of the RAG application. To help resolve this, you can optimize in the following places:

Consider tradeoffs between token limit and performance
Select a lower-dimensional embedding model
Select a more optimized algorithm for your Vector Search Index
Select the most appropriate base LLM model

Consider tradeoffs between token limit and performance

It’s important to note that there is a tradeoff between a higher token limit and performance. A model with a higher token limit can process more information at once, but it will take longer for this information to be processed.

You can select a model with a smaller token limit to improve performance. In order not to exceed the token limit, the documents can be split into smaller, meaningful parts. If splitting the document into meaningful small chunks is challenging, passing summaries can be a good alternative. Different chunking strategies can be adopted depending on the data at hand. It is also preferable to use shorter chunks when the RAG app answers short user queries. If you don't want to select a model with a smaller token limit, another strategy could be to reduce the number of documents retrieved when hitting the token limit.

The underlying technique of the model also has an impact on the management of tokens. For example, the Mixture of Experts (MoE) approach that Databricks (DBRX) uses is beneficial in managing token counts and maintaining performance. The MoE approach allows the model to be more efficient in terms of inference as it uses relatively fewer parameters for each input, which makes them faster at inference than what their total parameter counts would suggest. For instance, DBRX can generate text at up to 150 tokens per second per user, which is significantly faster than many dense models.

The MoE approach also helps to keep the token counts reasonable. DBRX uses the GPT-4 tokenizer, which is known to be especially token-efficient. This means that fewer tokens are necessary to reach the same model quality, which helps to keep the token counts manageable.

Select a lower-dimensional embedding model

An embedding model is used to transform the input source text into a vector, as well as the user query. When choosing an embedding model, selecting one based on a larger number of dimensions isn't always the optimal choice.

As the dimensionality increases, more context is incorporated into the embedding, thus making the embedding more knowledgeable; however, so does the complexity. The more complex the model, the slower and heavier the processing (which drives up costs!).

It's worth noting that the models with the most dimensions aren't necessarily the ones leading the MTEB leaderboard. When users search for a model on Hugging Face, it’s possible to filter by task to choose embedding models that are best for embedding similarity, as shown below.

Hugging Face model results filtered out for Sentence Similarity

Important!

It’s crucial to use the same embedding model for both the user's query and the source documents. Even though the embedding of source documents is a separate pre-processing action from the user question's embedding, the similarity search will not work unless the same embedding model is applied. In other words, the space where the documents are located and the space where the question is positioned must be identical.

Select a more optimized algorithm for your Vector Index

Applications, especially those with real-time requirements, need query results to be returned as quickly as possible. However, If the vector database is large and not optimally indexed, the retriever may not be able to return relevant documents quickly enough.

Optimizing the data structure of the index for quick traversal can be achieved by using more advanced algorithms. Databricks Vector Search uses the Hierarchical Navigable Small World (HNSW) algorithm for its approximate nearest neighbour searches. This algorithm is used to identify the most similar vectors and return the associated documents.

Additionally, there are performance considerations when performing similarity searches. By default in Databricks, the similarity between vectors is measured using the L2 distance metric. If users want to use cosine similarity, they will need to normalize their datapoint embeddings before feeding them into Databricks Vector Search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produced by cosine similarity.

Select the most appropriate base LLM model

The choice of the base LLM is crucial, along with the choice of embedding model. Models vary across properties such as reasoning ability, propensity to hallucinate, context window size and serving cost. All the considerations highlighted in assessing the embedding model section above are also applicable when choosing LLMs.

Similar to the embedding model, the higher the number of dimensions, the heavier the model will be to be served (and by extension application latency). Choosing the best LLM should consider the dimension and the task at hand.

Ultimately, the LLM should be chosen by considering the tradeoffs between the number of dimensions and latency and the type of task that the RAG app is performing (e.g. Question Answering, Sentiment Analysis), as shown in the example below.

Finally, depending on the application that is being built, the more tokens are included in the prompt, the longer the LLM will take to generate the answer.

Hugging Face model results filtered out for Question Answering

Challenge 2 - “I cannot update and delete data from the vector store”

Vector libraries and databases are both solutions specialized in storing vectorized data, with integrated search capabilities. However, selecting the right store that meets your requirements and data needs is imperative.

Select the right vector store

With vector libraries, the associated vector index needs to be recreated for all the documents if some of the documents require modification or deletion. So it’s only recommended to use a vector library as long as the data doesn’t change often.

On the other hand, vector databases are useful when users need database properties, have big data, and when changes in the index are likely to occur often. Vector databases allow users to update and delete data during the import process.

Databricks natively supports serving and indexing data for online retrieval. For unstructured data such as PDF documents, Vector Search (Databricks managed vector database) will automatically index and serve data from Delta tables (using the Delta Sync Index) and for structured data, Feature and Function Serving provides millisecond-scale queries of contextual data.

Another option that can be considered when building RAG applications is leveraging your existing relational databases or search systems that offer vector search plugins. However, ensure they support all the functionalities needed.

Challenge 3 - “The app retrieves sensitive data that users should not have access to”

It's absolutely pivotal for RAG applications to restrict access to sensitive data. These systems combine automated content generation with data retrieval from multiple sources, raising the risk of unauthorized data exposure.

This is particularly important when building customer-facing RAG apps where it’s crucial to ensure that no sensitive data is inadvertently revealed to customers. Moreover, safeguarding sensitive data maintains the technology's credibility and trustworthiness, crucial for its broad acceptance and usage.

Select Unity Catalog for your governance and security framework

Databricks natively supports serving and indexing data for online retrieval by enforcing access control settings between online and offline datasets through Unity Catalog. With UC integration, RAG workflows can be designed to retrieve only the documents that a user has access to.

Unity Catalog also supports lineage-tracking allowing users to trace the flow of data through the RAG process for auditing purposes.

Challenge 4 - My retriever is returning irrelevant documents

The retriever needs to be able to return at least one relevant document to the user question. But what if this is not the case?

Assess the chunking strategy and choice of embedding model

If the retriever doesn't return relevant documents, the cause is most likely related to the chosen chunking strategy or the choice of embedding model. Users should ask themselves the following questions:

Is the embedding model used to embed the user query different from the model used to embed the document chunks?
Does the user query contain any special terms that don't exist in the data?
Are the chunks too small or too big?
Is the choice of embedding model adequate for the domain? For example, a language model’s embedding adapted for the medical sector can be more performant than general embeddings given the domain-specific vocabulary. This ensures that words that are rare in the general domain, but common in a specific domain are not broken down into multiple tokens.

It’s also possible that there is an issue with the similarity search, resulting in returning non relevant data.

Assess the similarity measure and index

Different metrics can be used to calculate the similarity between two documents, such as the Manhattan distance, Euclidean distance, or cosine similarity. Depending on the data, some metrics may be more suitable than others.

As explained previously, the indexing algorithm also plays a role. Text data typically results in high-dimensional vector representations, so algorithms designed to handle those efficiently like HNSW, are often preferred.

Filter out irrelevant documents

Using filters can help the vector index retrieve the right information. This is useful if your RAG app failed to include relevant information in the top-ranked documents that were returned. Filters on the metadata side are similar to rules telling the search engine what kind of information to restrict the search space to before performing a similarity search on this filtered dataset. For example, if a user wants to find information tagged with “LAW” metadata, the data will be filtered out to only keep documents tagged with the “LAW” flag, and then perform similarity search on those document chunks. The downside of using this technique is that the search engine could fail to retrieve relevant information if it doesn’t exactly match the rule. Please note that here we’re referring to pre-query filtering as this is the most commonly used technique, however, there are other filtering options available, such as post-query or in-query filtering.

Having addressed the specific challenges related to retriever functionality, let's deep dive into measuring the quality of the results.

Challenge 5 - I don’t trust the quality of the content produced by the RAG app

Sometimes the low quality of the RAG app’s output is due to the fact that the LLM is ignoring part of the prompt. A possible solution is to check the size of the prompt.

Check the size of the prompt

RAG applications retrieve documents from the vector store and pass them as extra context in the prompt, resulting in increased prompt length. The challenge is that LLMs have context windows similar to viewing frames that can't always fit an entire document. The size of these context windows can vary between different LLMs, and the token limit refers to the maximum number of tokens that a model can process at once. For instance, a model that can only process 1000 tokens is typically limited to five pages of content. This means it wouldn't be able to process a 50-page PDF report. GPT-4 can handle up to 128k tokens in its context window, whereas Llama2 can handle 4k tokens.

In order not to exceed the token limit, the documents can be split into smaller, meaningful parts, which are then embedded and stored in the vector store. The chunking strategy will vary depending on the type of document and the information it contains. Splitting the document in a way where the separate parts lose their individual meaning should be avoided. If splitting the document into meaningful small chunks is challenging, passing summaries can be a good alternative. Different chunking strategies can be adopted depending on the data at hand.

Another thing to consider is that large language models can sometimes struggle to process lengthy prompts efficiently, leading to incomplete use of the information provided. To address this issue, consider breaking down the prompt into smaller, more focused segments or providing key points in a concise manner to ensure the model can better comprehend and respond to the input effectively. Certain models also struggle to retrieve the relevant information from a lengthy prompt depending on its position in the prompt.

Tune the Temperature parameter

The temperature parameter is a key component that plays a role in the generation of responses in language models. It controls the randomness of the output. The temperature parameter is a float in the range [0,2]. A temperature of 0 makes the output deterministic, meaning the model will always choose the most likely next word when generating text. As the temperature increases, the model introduces more randomness into the output, making it more creative but potentially at the expense of ignoring some context in the prompt.

The trade-off here is between creativity and coherence. A higher temperature value can lead to more diverse and creative outputs, but it may also result in the model generating text that deviates from the original context or doesn't make sense. On the other hand, a lower temperature value will make the model stick closely to the context and generate more predictable text, but it may lack creativity.

It's important to note that the optimal temperature value can depend on the specific use case. For some applications, a more creative output might be desirable, while for others, it might be more important to maintain strict coherence with the context.

Evaluate the RAG output

By forcing the LLM to create the final output based on the relevant, external knowledge added in the prompt, RAG can limit hallucinations and incorrect content. In order to avoid hallucinations, ask the model to say “I don’t know” if it cannot find the answer from the Vector index. However, once an LLM application is deployed, it can be difficult to evaluate the output quality due to a lack of ground truth.

First, it’s possible to explicitly print the relevant documents that the retriever has retrieved and validate the LLM answer by going through the relevant documents and rating each of the responses.

App users can also provide feedback after receiving a response from the RAG app (e.g. thumbs up or down). This has the advantage of being less time-consuming than evaluating the application with human grading (labelling).

For automating the evaluation process, Databricks recommends using an LLM-as-a-judge approach, where the LLM's grading reflects human preference in terms of correctness, readability, and comprehensiveness of the answers. Another recommendation is to use a low-precision grading scale for easier interpretation and consistency of grading scales among different LLM judges.

However, it’s not enough to have information on how well our RAG app performs unless we have a baseline to compare it against, which is why users should start with a benchmark.

Establish guardrails and constraints

LLMs can generate harmful content or perpetuate biases, so it’s important to establish guardrails and constraints. Prior to building the vector store, data should be preprocessed to remove biases and sensitive information that could be in the generated output. It’s also possible to set up a feedback loop where the user feedback and the app’s performance metrics are used. Databricks supports guardrails to wrap around LLMs and help enforce appropriate behavior. In addition to guardrails, Databricks provides Inference Tables to log model requests. It’s important to continuously monitor the app’s outputs in real-time to detect any deviations from the established guardrails and to incorporate human oversight and intervention to review and correct the outputs when necessary.

Sometimes, relevant information will be added to the prompt, but the model fails to generate a correct answer because of contradictions in the source documents. This highlights the need to pre-process the source documents before adding them to a Vector Index.

Monitor responses

Maintaining a high quality of the RAG application responses requires consistent monitoring of what constitutes anomalous, unsafe, or toxic output. Databricks Lakehouse Monitoring provides a fully managed quality monitoring solution for RAG applications. It can automatically scan application outputs for toxic, hallucinated, or otherwise unsafe content. This data can then feed dashboards, alerts, or other downstream data pipelines for subsequent action.... With Databricks Lakehouse Monitoring, model performance can be monitored over time using Databricks Lakehouse Monitoring. Users can leverage this notebook to monitor text quality from endpoints serving RAG applications.

Thanks to these strategies, users can assess the app’s results and responses effectively.

Conclusion

In this blog post, we explored five challenges that can occur when building RAG applications and some strategies to address these challenges:

Challenge 1 - “My RAG app is slow at generating an answer”: There are a lot of factors that impact performance, including token limit tradeoffs, the choice of embedding model, similarity measure metric, or choice of LLM.

Challenge 2 - “I cannot update and delete data from the vector store”: If the solution doesn’t rely on a scalable vector store, users can face the challenge where data cannot be updated or deleted in their vector store.

Challenge 3 - “The app retrieves sensitive data that users should not have access to”: It’s important to safeguard the application with the right governance framework and security guardrails so that it doesn’t retrieve documents users should not have access to.

Challenge 4 - “My retriever is returning irrelevant documents”: When this occurs, users should assess their chunking strategy, and choice of embedding model and ensure that they chose the appropriate metric for measuring similarity and retriever algorithm. Filters can also be used to filter out irrelevant documents.

Challenge 5 - “I don’t trust the quality of the content produced by the RAG app”: It could be that the prompt is too long and over the LLM token limit. When building a RAG app, it is important to evaluate the quality of the outputs from the start. This can be achieved by setting a benchmark, establishing guardrails and constraints, and keeping an eye on the app’s performance through monitoring responses when the RAG app is deployed.