Databricks Community

pascal_vogel · ‎11-13-2024

This post is written by Ellen Hirt, Senior Specialist Solutions Engineer, and Pascal Vogel, Solutions Architect.

Over the past year, Databricks has supported many teams with building Retrieval Augmented Generation (RAG) applications. We've noticed that they often face similar challenges and questions. In this blog post series, we share practical tips and best practices from our experience to help you build high-quality RAG applications on Databricks.

We provide hands-on advice for each pipeline step to help you tackle common challenges.

Content

Introduction
Data preparation: laying a foundation for high-quality RAG
1. Ingestion
2. Parsing
3. Chunking
4. Metadata extraction
5. Embedding
6. Indexing
Summary and conclusion

Introduction

RAG has emerged as an essential architecture pattern for generative AI applications. It incorporates up-to-date and private organizational knowledge for improved relevance and understanding, grounds LLM responses to minimize hallucination, and avoids costly fine-tuning or pretraining of custom LLMs.

Screenshot 2024-11-06 at 09.29.11.png

RAG encompasses three high-level stages, each offering opportunities for quality optimization:

Data preparation: Ingesting, cleaning, and structuring diverse data types to build a knowledge base ready for retrieval.
Retrieval: Retrieving relevant chunks of indexed information in response to a user query.
Augmentation and generation: Augmenting the user prompt with retrieved information and generating a relevant and accurate response.

This series emphasizes RAG quality, referring to relevant and accurate responses from a RAG application. Other factors, such as cost or response latency, are not our primary focus but are covered extensively in existing blogs:

If you are new to RAG on Databricks, take a look at the Databricks Generative AI Cookbook and Build High-Quality RAG Apps with Mosaic AI hands-on demo. These resources serve as a starting point for implementing the recommendations we introduce in this article.

Any RAG application development should be accompanied by a thorough evaluation to determine if changes have a lasting positive impact on quality. Follow our in-depth RAG evaluation tutorial to learn how to use Databricks tooling to evaluate RAG application performance (e.g., end-to-end latency, token counts) and quality (e.g., retrieval quality, correctness, groundedness).

Data preparation: laying a foundation for high-quality RAG

As with traditional machine learning, the “garbage in, garbage out” principle also applies to RAG systems, which is why focusing on data preparation can significantly improve your RAG quality, and we start our blog series with this critical and challenging stage.

In the following, we go through the data preparation process, as depicted in the below image.

Screenshot 2024-11-06 at 09.12.09.png

1. Ingestion

RAG quality considerations begin at the data ingestion stage in Databricks, where data is first brought into the platform. Ingestion is not a one-off process, and the recommended patterns depend primarily on the type of data being ingested and how frequently that data is updated. In RAG applications, data is often dynamic, requiring periodic updates, re-indexing, or even real-time syncing to ensure the most current information is available for retrieval and generation.

Frequently used data types for RAG can range from PDF documents to (semi-)structured data such as CSVs or JSON documents, all of which you initially store in Unity Catalog volumes.

Considerations to keep in mind during this stage of RAG data preparation from a quality perspective include:

Ensure you preserve file metadata. Metadata filtering is critical for effective retrieval, and metadata needs to be extracted during data preparation.
Handle updates in source files: ensure that up-to-date information is available to your RAG application. Use Databricks Workflows to schedule jobs that are triggered when new files arrive, use LakeFlow Connect for managed ingestion from various data sources, or use Auto Loader to detect changes in a source file directory.
If your application depends on multiple source systems, you may need to set up multiple ingestion pipelines for different types of documents.

2. Parsing

After identifying your data sources and ingesting data into Databricks, the next step is to parse content from your documents so that your RAG application can read it effectively. Like ingestion, parsing does not offer a one-size-fits-all approach, and different types of documents will present different challenges, particularly in preserving structure and ensuring accurate data extraction.

One consideration during parsing is whether to perform naive extraction, which results in a simple flat text output, or to preserve the structure of the original documents by retaining sections, sub-sections, paragraphs, and formatting. Structure-preserving parsing is generally preferable, as it maintains the context and relationships within the document, which in turn enhances the quality of your RAG application.

In addition to text and its structure, your documents may also contain tables, diagrams, and images that hold key information. Naively parsed tables can lead to misinterpretation by LLMs, and omitting diagrams and images during naive parsing can also result in the loss of critical information.

Parsing best practices

Data cleaning

Pre-process your data to eliminate noise and reduce malformed information. This includes removing or normalizing special characters and handling structural elements like paragraphs, headers, and footers. More advanced operations can consist of entity deduplication, spelling and grammar correction, handling redundancies, data normalization, and applying custom cleaning rules tailored to your documents.

Error and exception handling

Implement error handling and logging to identify and resolve problems during parsing. In Databricks, you can leverage the logging capabilities of jobs and Delta Live Tables to do so.

Customize parsing logic

To handle your documents’ custom format and preserve structure, tables, and images, you may need to implement logic using specialized libraries available in Databricks. These include, among others, Unstructured, PyPDF2, LangChain, HTML Parsing Libraries like BeautifulSoup, OCR tools, image recognition models and APIs, and more.

You can also leverage proprietary tools. You can find an extensive overview in our RAG cookbook. Furthermore, you may need to add domain-specific rules, use regular expressions, or combine different tools and techniques. The choice of tools will depend on the type of documents you are working with, such as HTML, PDF, or scanned images. Multi-modal LLMs can generate text descriptions of images and enable a multi-modal RAG approach, which is not the focus of this article.

Check the output quality

To ensure output quality, you should visually inspect processed documents. Evaluate the speed and check the preservation of document structure, table and image extraction accuracy, and consistency across different document types. Additionally, assess your pipeline's ability to handle complex formats and edge cases.

Ensure scalability

Ensure that the data parsing solution is scalable and efficiently handles large volumes of documents as your application grows. On Databricks, you can build scalable data ingestion pipelines using tools such as Databricks Workflows, Auto Loader, and file triggers. Delta Lake provides highly scalable storage for source files and parsed documents alike.

Independently of the document source, parsed document content should be stored as chunks in Delta tables governed by Unity Catalog in preparation for the next data preparation steps. By keeping the parsed and later chunked raw data in Unity Catalog managed tables, you can apply fine-grained permissions, access lineage, and openly share this source data.

3. Chunking

Chunking plays a critical role in the effectiveness of your RAG application, as it determines the size and relevancy of context passed to an LLM. While many LLMs have generous context windows, filling these up with chunks is usually not the best solution and may lead to the “needle in a haystack” problem of declining retrieval performance in large-context LLM calls. Furthermore, you need to leave some space for instructions in your prompt.

An ideal chunking approach depends on your documents and requires some experimentation to find what works best. A solid chunking strategy can elevate the quality of RAG responses.

Chunking considerations

Before you start chunking, you should evaluate your documents:

Are they long articles or shorter items like individual messages? Larger chunks may contain multiple topics, whereas shorter chunks are inherently more focused.
What is their structure - consistent, nested, or fragmented across documents? Is there any inherent structure that you should leverage?
How frequently do you plan to add new documents? What size and number of documents are you expecting? Highly frequent updates of document chunks may impact ingestion performance.
How do documents relate to one another?
Do you know the user behavior, and can you expect long queries? Consider the chunk size versus the query size - since you compare these two vectors, this influences the similarity score during retrieval.
Which embedding model do you plan to use in the next step? If you plan to use, e.g., a sentence-transformer embedding model, shorter chunks may help the model focus on more specific and coherent information.

Chunking strategies and parameters

Different chunking strategies with varying levels of complexity exist:

The most straightforward chunking strategies are character- or token-based: they simply divide text into smaller units without understanding the relationships between units, e.g., by splitting on “.” or “\n”.
Context-aware chunking provides more meaningful and accurate segmentation by considering the surrounding context. Some examples of this strategy include:
- Sentence-based: Whereas character-based chunking may split up sentences, sentence-based chunking keeps punctuation between sentences in mind, thus preserving more grammatical structure.
- Recursive: This involves breaking down chunks into smaller sub-chunks in a hierarchical manner, reflecting the content's nested structure. A well-known example is the LangChain RecursiveCharacterTextSplitter.
- Specialized: This can include, e.g., taking into account the present structure of a document, such as splitting on headers for an HTML page or other domain-specific rules.
- Semantic: A more recent approach, semantic chunking considers the cohesiveness within the document, dividing the text into semantically coherent, complete chunks. It may, however, be slower than other methods.

Furthermore, documents with hierarchical structures or connected topics across different documents might benefit from knowledge graphs or more advanced chunking methods. While these advanced chunking methods are not the focus of this article, some of them include parent-child or hierarchical chunking, tree-based chunking (e.g., RAPTOR), or leveraging knowledge graphs (GraphRAG).

Chunking starting points

Although the exact parameters depend on the chunking method you choose, the chunk size and chunk overlap are common parameters for chunking (e.g., for recursive chunking). These parameters strongly vary based on your documents and the embedding model you plan to use, but you have to start somewhere.

If you cannot make a concrete determination for chunk size based on your documents, a typical chunk size starting point is between 200 and 500 tokens. A chunk with 500 tokens will cover about one page of text. For overlap, start, for example, with 10% of the chunk size, in this case, 50 tokens. Don’t get fixated on these numbers; instead, start iteratively evaluating both higher and lower chunk sizes and overlaps to find out what works for your use case. Furthermore, keep in mind the natural structure of your documents, such as their paragraph sizes.

4. Metadata extraction

Metadata is supplemental data stored alongside your indexed documents in a vector store. Leveraging metadata effectively can have a significant impact on the quality, performance, and security of your RAG application.

Mosaic AI Vector Search supports metadata filtering, which allows you to include or exclude document chunks based on their respective metadata and filter conditions. Only included chunks are considered for similarity search. Excluding irrelevant chunks prevents them from skewing the results of the similarity search.

Metadata filtering can also reduce query runtime by reducing the set of chunks whose similarity is compared. Finally, metadata can be used to apply access controls, for instance, by filtering documents based on tags in case they contain PII.

Metadata examples

What metadata to store with your document chunks is highly dependent on your use case. To decide which metadata to extract, consider what fields your users will use in their queries. Are they likely to ask about entities that you can include in tags, such as product or model names? Will they query based on years or date ranges?

Take a look at the following generic examples for some inspiration:

File	Document	Dates	Security	Categorization
File name File size Source format	Document title Author Version Description Comment Type Language	Created Indexed Effective date Expiration date	PII GDPR HIPAA	Category Tags Keywords Taxonomy

Extraction approaches

Some metadata, such as file size or source format, can be extracted directly during ingestion and parsing. Other data, such as document author or language, may be retrieved from the source system that stores the document (e.g., a document management system or internal wiki).

In cases where documents follow a repeatable structure, you can rely on automated extraction methods. Typical examples include standard reports, legal and regulatory documents, or invoices. Regular expressions can be used to extract simple patterns like dates or email addresses. Heuristic rules can be applied to parse HTML or XML documents for metadata, e.g., to extract keywords, dates, or authors from their respective HTML tags with Beautiful Soup.

Traditional natural language processing techniques such as entity recognition, text classification, topic modeling, or sentiment analysis can all provide automated input for metadata fields. For instance, NER models such as GLiNER or the bert-NER family of models can cost-effectively generate keywords or tags.

Finally, you can leverage LLMs to extract metadata such as document titles, summaries, keywords, and other entities from document chunks. Using function calling on Databricks ensures structured output from an LLM that can be reliably fed into metadata fields. Libraries like LangChain and LlamaIndex also feature pre-built extractors that you can integrate into your document ingestion pipelines.

Storing metadata

In Mosaic AI Vector Search, extracted metadata is stored in columns side-by-side with the document chunk and vector embedding. The following example table contains document chunks in the content column, vector embeddings in the __db_content_vector column, and has, for instance, category as a filterable metadata column:

Screenshot 2024-10-17 at 11.40.39.png

In the next part of this blog series, we will cover how you can extract metadata fields from user queries and use them to apply metadata filtering as part of vector search queries.

5. Embedding

Embedding models are used to transform the text that makes up document chunks into dense vectors or vector embeddings. These vectors represent the semantic meaning of the text in a continuous vector space. Vector embeddings are the basis for vector search, enabling the retrieval of semantically similar documents based on the proximity of their vector representations rather than exact keyword matching.

Selecting a suitable embedding model is important for the quality of your RAG system's results. It affects the proximity of related chunks and also influences performance and cost-effectiveness.

When picking an embedding model for your RAG use case, you should consider a number of factors.

First, you should choose a model that has been pre-trained on data similar to your documents. This similarity should encompass not only the language of your documents but also their topical content, ensuring the model accurately captures the semantic meaning.

Some embedding models are trained on data in a single language, such as BGE-Large-EN, which is trained on English language data. Other models are trained on a broader set of languages and can be used for multilingual applications, like BGE-M3.

Depending on the topical content of your documents and the presence of domain-specific terms, you may achieve better results by relying on specialized embedding models. For example, BioBERT has been trained on biomedical data such as PubMed abstracts and papers, whereas LegalBert has been trained on legal texts, contracts, and court decisions.

If you can’t find an existing model that offers satisfactory performance for your source documents, you can use Mosaic AI Training to fine-tune a custom embedding model tailored to your domain.

Understanding the inner workings of your embeddings is crucial

An embedding model transforms a piece of text into a vector with a certain number of dimensions. For instance, the BGE-Large-En model maps text to a vector with 1,024 dimensions. Choosing a model with higher dimensionality can allow the embedding to capture more nuanced information about the contextual meaning of the text.

However, these embeddings may also capture more noise and do not automatically lead to a better result. High-dimensionality models also require more compute resources to generate embeddings which in turn impacts cost. Mosaic AI Vector Search can store embeddings with up to 4,096 dimensions.

Besides dimensionality, the length of the embedding model context window is a key consideration. For instance, BGE-Large-En supports a context window of up to 512 tokens, while GTE-Large-En supports up to 8,092 tokens. A longer context window can capture semantic meaning across multiple sentences and even pages of text in a single vector embedding. Determining the ideal context window length depends largely on the token length of document chunks and user queries.

If the context window is too short, the embedding may miss context, which is essential for accurate semantic representation. On the other hand, while a longer context window can capture more semantic connections, it also introduces challenges such as increased computational requirements, cost, latency, and potential dilution of relevance by including too much (irrelevant) information. If your text is short but the model has a larger context window, padding may be used for unfilled slots.

How to find the right embedding model for your use case

The Massive Text Embedding Benchmark (MTEB) leaderboard compares both open-source and proprietary embedding models based on a number of attributes and benchmarks their performance on eight representative tasks. Filter the leaderboard to narrow down the relevant models for your use case. Some attributes, such as the supported context window or language support, may be easy to determine based on factors such as your defined chunk size and source documents. Other attributes, such as model size, are easier to determine based on real-world evaluation.

Building an effective and representative evaluation data set for your RAG application and benchmarking different embedding models using Mosaic AI Agent Evaluation can quickly demonstrate which embedding model is best suited for your use case.

On Databricks, you have a variety of options for deploying embedding models. Use the pay-per-token Foundation Model APIs to experiment with embedding models such as BGE Large (En) or GTE Large (En). For production use cases, use the provisioned-throughput Foundation Model APIs.

If you want to deploy an open-source embedding model that is not available via Foundation Model APIs, you can deploy it as a custom model using Mosaic AI Model Serving. Finally, to use an embedding model deployed outside of Databricks, including proprietary models from OpenAI or Anthropic, consider external models in Mosaic AI Model Serving.

When using Mosaic AI Vector Search, you can choose between several options for providing vector embeddings, including a managed option where Vector Search computes embeddings as part of ingesting chunked data into a vector search index for you (using an embedding model of your choice).

You should use the same embedding model for embedding document chunks and the user query to ensure both are mapped to a common vector space, enabling accurate similarity comparisons.

6. Indexing

Once data is parsed, chunked, embedded, and corresponding metadata defined, it needs to be indexed into a Vector Search index.

With Mosaic AI Vector Search, ingesting data into a vector index from a Delta table and keeping the index in sync with the source can be fully automated. Learn more about the options for indexing data in the Vector Search documentation.

Mosaic AI Vector Search uses a Hierarchical Navigable Small Worlds (HNSW) algorithm to perform Approximate Nearest Neighbor (ANN) searches, identifying the most relevant document chunks based on their vector embeddings. Index optimization is fully managed by Databricks, and no manual optimizations are necessary.

To separate different knowledge domains, consider creating multiple vector search indexes. During retrieval, you can then implement logic for routing queries to the appropriate index based on extracted metadata or query characteristics. By eliminating entire knowledge domains from the search space through index routing, you can improve both the efficiency and the quality of responses, especially in complex, multi-domain environments.

Summary and conclusion

Building a high-quality RAG application requires careful attention to each stage of the data pipeline. By focusing on the proper ingestion, parsing, chunking, embedding, and metadata extraction techniques, you can improve the accuracy and relevance of your RAG system's responses.

If you’re new to RAG, check out the Databricks Generative AI Cookbook and Build High-Quality RAG Apps with Mosaic AI hands-on demo as starting points. Next, dive deeper into each stage of the data pipeline and experiment with the techniques and tools outlined here.

Any optimizations you attempt should be accompanied by a thorough evaluation to determine if they have a lasting positive impact on quality. Follow our in-depth RAG evaluation tutorial to learn more.

Data preparation is the first of three stages of RAG. In the next post, we'll dive deeper into optimizations in the retrieval stage.