cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
cancel
Showing results for 
Search instead for 
Did you mean: 
maria_zervou
New Contributor II
New Contributor II

Introduction

Welcome to our technical blog series on the challenges encountered when building and deploying Retrieval-Augmented Generation (RAG) applications. RAG is a GenAI technique used to incorporate relevant data as context to a large language model (LLM) without the need to further train the LLM (known as “fine-tuning”). 

While RAG applications can transform vast amounts of raw data into insightful, context-rich answers, developers often face several hurdles that can impact the performance, security, and reliability of these applications.

In this series, we aim to unpack common challenges that you might encounter when implementing RAG solutions.

In this blog, we will explore the first three challenges. Part 2 of this series will detail challenges 4 and 5. 

By the end of this series, you should have a comprehensive understanding of the potential setbacks in RAG application development and practical strategies to overcome them, ensuring your applications are not only functional but also secure and efficient. 

Quick Primer - RAG Application Architecture

Before delving into specific challenges associated with building RAG applications, it's essential to understand the underlying architecture and how it operates.

A RAG application is a complex system that integrates many components to achieve production-level quality outputs. RAG shines in use cases where the relevant information is continuously evolving (e.g. customer order data), or where the model encounters unfamiliar data outside its training set. Unlike traditional prompt engineering, RAG can adapt to provide up-to-date and contextually correct outputs without the need to pre-train or fine-tune the model.

RAG operates by scanning a large corpus of documents to identify content relevant to a specific query. This is achieved through a vector database where queries and documents are converted into high-dimensional vectors. The similarity between these vectors determines the relevance of documents to the query.

When a user raises a query, the system retrieves the pertinent data from the vector database to provide context to the LLM. This relevant data is then used to augment the LLM prompt, ensuring the generated responses are tailored and accurate.

What can go wrong with RAG applications

Building a RAG application involves several critical steps:

  1. Embedding the source documents and the query.
  2. Conducting a similarity search to identify and retrieve relevant content.
  3. Integrating the query and the retrieved data to formulate the final prompt for response generation.

maria_zervou_0-1713881061926.pngUsers might encounter challenges anywhere in the process, and so each step - including the operational aspects - requires careful consideration.

Step 0 - Establish the initial benchmark

Before troubleshooting and iterating on the RAG application by optimizing each component, it’s crucial to start with a benchmark to compare against - as is usually done for machine learning projects. Without a benchmark, you will not be able to evaluate if the solution you have built is performing well and if it is demonstrating business impact to your stakeholders. 

In order to build a benchmark for RAG, you should first deploy RAG components in the most basic setup, without tuning configurations. Once this benchmark is in place, you can iterate and monitor the quality of responses.

Challenge 1 - “My RAG app is slow at generating an answer” 

Latency can be experienced at different steps of the RAG application. To help resolve this, you can optimize in the following places:

  • Consider tradeoffs between token limit and performance
  • Select a lower-dimensional embedding model
  • Select a more optimized algorithm for your Vector Search Index
  • Select the most appropriate base LLM model

Consider tradeoffs between token limit and performance

It’s important to note that there is a tradeoff between a higher token limit and performance. A model with a higher token limit can process more information at once, but it will take longer for this information to be processed. 

You can select a model with a smaller token limit to improve performance. In order not to exceed the token limit, the documents can be split into smaller, meaningful parts. If splitting the document into meaningful small chunks is challenging, passing summaries can be a good alternative. Different chunking strategies can be adopted depending on the data at hand. It is also preferable to use shorter chunks when the RAG app answers short user queries. If you don't want to select a model with a smaller token limit, another strategy could be to reduce the number of documents retrieved when hitting the token limit.

The underlying technique of the model also has an impact on the management of tokens. For example, the Mixture of Experts (MoE) approach that Databricks (DBRX) uses is beneficial in managing token counts and maintaining performance. The MoE approach allows the model to be more efficient in terms of inference as it uses relatively fewer parameters for each input, which makes them faster at inference than what their total parameter counts would suggest. For instance, DBRX can generate text at up to 150 tokens per second per user, which is significantly faster than many dense models. 

The MoE approach also helps to keep the token counts reasonable. DBRX uses the GPT-4 tokenizer, which is known to be especially token-efficient. This means that fewer tokens are necessary to reach the same model quality, which helps to keep the token counts manageable.

Select a lower-dimensional embedding model

An embedding model is used to transform the input source text into a vector, as well as the user query. When choosing an embedding model, selecting one based on a larger number of dimensions isn't always the optimal choice. 

As the dimensionality increases, more context is incorporated into the embedding, thus making the embedding more knowledgeable; however, so does the complexity. The more complex the model, the slower and heavier the processing (which drives up costs!). 

It's worth noting that the models with the most dimensions aren't necessarily the ones leading the MTEB leaderboard. When users search for a model on Hugging Face, it’s possible to filter by task to choose embedding models that are best for embedding similarity, as shown below. 

maria_zervou_1-1713881061970.pngHugging Face model results filtered out for Sentence Similarity 

 

Important!

It’s crucial to use the same embedding model for both the user's query and the source documents. Even though the embedding of source documents is a separate pre-processing action from the user question's embedding, the similarity search will not work unless the same embedding model is applied. In other words, the space where the documents are located and the space where the question is positioned must be identical.


Select a more optimized algorithm for your Vector Index

Applications, especially those with real-time requirements, need query results to be returned as quickly as possible. However, If the vector database is large and not optimally indexed, the retriever may not be able to return relevant documents quickly enough. 

Optimizing the data structure of the index for quick traversal can be achieved by using more advanced algorithms. Databricks Vector Search uses the Hierarchical Navigable Small World (HNSW) algorithm for its approximate nearest neighbour searches. This algorithm is used to identify the most similar vectors and return the associated documents.

Additionally, there are performance considerations when performing similarity searches. By default in Databricks, the similarity between vectors is measured using the L2 distance metric. If users want to use cosine similarity, they will need to normalize their datapoint embeddings before feeding them into Databricks Vector Search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produced by cosine similarity.

Select the most appropriate base LLM model

The choice of the base LLM is crucial, along with the choice of embedding model. Models vary across properties such as reasoning ability, propensity to hallucinate, context window size and serving cost. All the considerations highlighted in assessing the embedding model section above are also applicable when choosing LLMs. 

Similar to the embedding model, the higher the number of dimensions, the heavier the model will be to be served (and by extension application latency). Choosing the best LLM should consider the dimension and the task at hand. 

Ultimately, the LLM should be chosen by considering the tradeoffs between the number of dimensions and latency and the type of task that the RAG app is performing (e.g. Question Answering, Sentiment Analysis), as shown in the example below. 

Finally, depending on the application that is being built, the more tokens are included in the prompt, the longer the LLM will take to generate the answer.

maria_zervou_2-1713881062147.pngHugging Face model results filtered out for Question Answering

 

Challenge 2 - “I cannot update and delete data from the vector store”

Vector libraries and databases are both solutions specialized in storing vectorized data, with integrated search capabilities. However, selecting the right store that meets your requirements and data needs is imperative.

Select the right vector store  

With vector libraries, the associated vector index needs to be recreated for all the documents if some of the documents require modification or deletion. So it’s only recommended to use a vector library as long as the data doesn’t change often. 

On the other hand, vector databases are useful when users need database properties, have big data, and when changes in the index are likely to occur often. Vector databases allow users to update and delete data during the import process. 

Databricks natively supports serving and indexing data for online retrieval. For unstructured data such as PDF documents, Vector Search (Databricks managed vector database) will automatically index and serve data from Delta tables (using the Delta Sync Index) and for structured data, Feature and Function Serving provides millisecond-scale queries of contextual data. 

Another option that can be considered when building RAG applications is leveraging your existing relational databases or search systems that offer vector search plugins. However, ensure they support all the functionalities needed.

Challenge 3 - “The app retrieves sensitive data that users should not have access to” 

It's absolutely pivotal for RAG applications to restrict access to sensitive data. These systems combine automated content generation with data retrieval from multiple sources, raising the risk of unauthorized data exposure. 

This is particularly important when building customer-facing RAG apps where it’s crucial to ensure that no sensitive data is inadvertently revealed to customers. Moreover, safeguarding sensitive data maintains the technology's credibility and trustworthiness, crucial for its broad acceptance and usage.

Select Unity Catalog for your governance and security framework   

Databricks natively supports serving and indexing data for online retrieval by enforcing access control settings between online and offline datasets through Unity Catalog. With UC integration, RAG workflows can be designed to retrieve only the documents that a user has access to. 

Unity Catalog also supports lineage-tracking allowing users to trace the flow of data through the RAG process for auditing purposes.

Conclusion

In Part 1 of this blog post, we explored three challenges that can occur when building RAG applications and some strategies to address these challenges:

Challenge 1 - “My RAG app is slow at generating an answer”: There are a lot of factors that impact performance, including token limit tradeoffs, the choice of embedding model, similarity measure metric, or choice of LLM.

Challenge 2 - “I cannot update and delete data from the vector store”: If the solution doesn’t rely on a scalable vector store, users can face the challenge where data cannot be updated or deleted in their vector store.

Challenge 3 - “The app retrieves sensitive data that users should not have access to”: It’s important to safeguard the application with the right governance framework and security guardrails so that it doesn’t retrieve documents users should not have access to.

Coming up next

Stay tuned for Part 2 of this series where we’ll explore two other common challenges occurring in RAG application development!

  • Challenge 4 - “My retriever is returning irrelevant documents”
  • Challenge 5 - “I don't trust the quality of the content produced by the RAG app”