Databricks Community

li_yu · ‎12-14-2023

Introduction

What is RAG and its use cases

Many conversation and assistance AI technologies now use the advanced natural language prowess of large language models (LLMs). However, these models only have access to the training data in which they were developed, limiting their knowledge to general domains. To empower LLMs with the dynamic, real-world data they need for specific domains, the Retrieval Augmented Generation (RAG) architecture has emerged.

The RAG approach involves retrieving pertinent data or documents related to a specific question or task and incorporating them as contextual information for the LLM. RAG has demonstrated effectiveness in applications such as support chatbots and Q&A systems that require real-time information or access to domain-specific knowledge for optimal performance.

What is Feature Factory and how it can help

How Feature Factory scales up fine tuning. How Feature Factory scales up fine tuning.

Feature Factory is an open source accelerator to simplify and unify feature engineering workflow which has been adopted by multiple enterprise ML/DS teams. It enables ML engineers to compute features efficiently on a large scale. By decoupling the core framework and feature definition catalogs, best practices of software engineering can be applied conveniently to your feature definition lifecycle management.

Feature Factory can scale up the data preparation process of Retrieval Augmented Generation. As shown in the architecture diagram above. The Spark engine of Feature Factory distributes a set of documents to multiple partitions. Data partitions will be processed in parallel by worker nodes in a Databricks compute/cluster on AWS or Azure. Documents can be parsed using a CPU cluster with compute optimized instance type. If the parsing needs LLM tokenization and metadata extraction, the computation can be done on Databricks GPU enabled clusters. The processed DataFrame will be saved as a table in Databricks Lakehouse. The Delta tables populate indices in Vector Stores (e.g. Databricks Vector Search). The indices in Vector Stores will be utilized to augment the queries in RAG applications.

By distributing the data preparation to worker nodes, Feature Factory significantly reduces the compute time, enabling data scientists or engineers to experiment more efficiently with different approaches (e.g. different chunking strategies, metadata extraction, etc.). Results of different approaches can be saved as different Delta tables for downstream processing.

Large Language Model Ops (LLMOps) can also benefit from Feature Factory. The data preparation can be defined in a collection/catalog in a unified way. This makes it easier to track and version control text data generation processes when combined with source version management systems like GitHub.

Data Preparation

All LLM related classes (e.g. parser, splitter, metadata extractor) are derived from a LLMTool abstract class. They all need to implement a create and apply method. The create method loads resource intensive/specific objects (e.g. a large model) only when needed. For example, a metadata extractor uses MPT 7B model to generate summaries. When the extractor is first instantiated, only the LLM parameters are passed in. Only when the create method is called for a data partition, the MPT 7B model will be created using the GPUs on the work node.

The apply method is where the processing logic is implemented for parsing, chunking or metadata extraction. It is invoked from a partition and it will be used to process each document allocated on that partition.

The create-apply pattern described above enables developers to leverage any open source tools such as LlamaIndex and LangChain.

In the code example below, a title extractor is defined using the TitleExtractor from LlamaIndex:

class MPT7b(LLMTool):
  def create(self):
    torch.cuda.empty_cache()
    generate_params = {
      "temperature": 1.0, 
      "top_p": 1.0, 
      "top_k": 50, 
      "use_cache": True, 
      "do_sample": True, 
      "eos_token_id": 0, 
      "pad_token_id": 0
    }

    llm = HuggingFaceLLM(
      max_new_tokens=256,
      generate_kwargs=generate_params,
      # system_prompt=system_prompt,
      # query_wrapper_prompt=query_wrapper_prompt,
      tokenizer_name="mosaicml/mpt-7b-instruct",
      model_name="mosaicml/mpt-7b-instruct",
      device_map="auto",
      tokenizer_kwargs={"max_length": 1024},
      model_kwargs={"torch_dtype": torch.float16, "trust_remote_code": True}
    )
    self._instance = llm
    return llm
  
  def apply(self):
    ...

TITLE_NODE_TEMPLATE = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\nGive a title that summarizes this paragraph: {context_str}.\n### Response:\n"

TITLE_COMBINE_TEMPLATE = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\nGive a title that summarizes the following: {context_str}.\n### Response:\n"

title_extractor = LlamaIndexTitleExtractor(
  nodes=5, 
  llm_def = MPT7b(),
  prompt_template = TITLE_NODE_TEMPLATE,
  combine_template = TITLE_COMBINE_TEMPLATE
)

The LLMFeature class defines a collection of tools to process documents. It consists of three components: parser, splitter and metadata extractor. Here is an example of defining a LLMFeature.

doc_splitter = LlamaIndexDocSplitter(
 chunk_size = 1024,
 chunk_overlap = 32,
 extractors = [title_extractor]
)
llm_feature = LLMFeature (
 name = "chunks",
 reader = LlamaIndexDocReader(),
 splitter = doc_splitter
)

After the LLMFeature is defined, assemble_llm_feature() method can be invoked on Feature Factory to generate the DataFrame of text chunks.

df = ff.assemble_llm_feature(spark, srcDirectory= "a directory containing documents", llmFeature=llm_feature, partitionNum=partition_num)

df.write.mode("overwrite").saveAsTable("<table name>")

In the example above, srcDirectory is the directory containing all input documents. The partitionNum is the number of Spark partitions during computation: i.e. if you have two worker nodes as GPU instances, you can set the partitionNum to be 2 to distribute the documents onto them. The output of the assemble method is a DataFrame with one column of chunks, which is the name of the LLMFeature. The DataFrame is then written out to a Delta table ready for downstream indexing.

Feature Governance

Feature governance is one of the benefits of Feature Factory. Developers can organize feature definitions as collections/catalogs. This makes it easy to track and version control the definitions across development teams. To make the LLMFeature definition easier to manage, a LLMCatalogBase class is provided as a container for defining all related classes and instances. The example below shows how to define a LLMFeature and its dependency variables inside the container class. The container class inherits from LLMCatalogBase which defines how to retrieve a LLMFeature instance from its member variables.

class TestCatalog(LLMCatalogBase):

  # define a reader for the documents
  doc_reader = LlamaIndexDocReader()

  # define a text splitter
  doc_splitter = LangChainRecursiveCharacterTextSplitter(
    chunk_size=1024, 
    chunk_overlap=64
  )

  # define a LLM feature, the variable name is the column name in the result dataframe
  chunk_col_name = LLMFeature(
    reader=doc_reader, 
    splitter=doc_splitter
  )

Document Parser

Existing Python libraries such as LangChain and LlamaIndex can be integrated with Feature Factory by implementing the create and apply method of LLMTool. Currently integrated parsers are LlamaIndexDocReader (based upon SimpleDirectoryReader) and UnstructuredDocReader (using unstructured APIs).

Document Splitter

Chunking strategies are also an important component in building a successful RAG system. The current implementation of doc splitters supports SimpleNodeParser of LlamaIndex, RecursiveCharacterTextSplitter of LangChain, and a custom tokenizer based splitter (TokenizerTextSpliter). Like doc readers, the splitter classes can be extended by subclass DocSplitter. Please note that metadata extractor is supported for the SimpleNodeParser. A LLM instance needs to be created for the metadata extraction. The LLM definition needs to subclass LLMTool and override the create method. An example of LLM definition can be found at: LLM notebook.

Metadata Extraction

Chunking strategies break up the text corpora into little pieces, resulting in context loss. One approach to mitigate this issue is to include metadata of documents (e.g. summaries) as context of a query.

Metadata of documents can be extracted using the metadata extractor of LlamaIndex. The metadata extractor uses OpenAI or models in Hugging Face to extract metadata such as summaries, keywords, titles etc. from text chunks. In Feature Factory, a LLM can be defined by subclassing LLMTool. The first code example shows how to define a LLM and utilize it for metadata extraction.

Feature Factory also provides a method to extract metadata from the file paths. For example, if your documents are stored in directories of years, you can extract the year as metadata if the directories are named as year=[actual year]. For example, if your document has the path of /tmp/year_of_publication=2023/doc1, after splitting, each chunk from that document will have year of publication: 2023 as the part of the header of the chunk.

Integration with Vector Store

The DataFrame generated from the LLMs APIs can be saved as a Delta table. The Delta table can be the source to build indices in a Vector Store. The example below shows how text chunks can be read from a Delta table and utilized to populate a search index in Chroma DB.

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document as LCDocument
import pandas as pd

parsed_df = spark.table("<your table>")
parsed_pdf = parsed_df.toPandas()
docs = [LCDocument(page_content=chunk, metadata={"id": id}) for id, chunk in parsed_pdf["chunks"].items()]

chroma_root = "/dbfs/tmp/llms/Chroma_db"

hf_embedding = HuggingFaceEmbeddings(
  model_name="sentence-transformers/all-mpnet-base-v2",
  model_kwargs={"device": "cpu"},
  encode_kwargs={"normalize_embeddings": False}
)

vector_storage = Chroma.from_documents(
  collection_name="ff_collection",
  documents=docs,
  embedding=hf_embedding,
  persist_directory=chroma_root,
)

vector_storage.similarity_search_with_score(query="your prompt", k=3)

The text processing methods in Feature Factory are agnostic to Vector Stores. A new vector store can be easily integrated with minor data format transformation between the Delta table and the preferred data format of the vector store selected.

Conclusion

In this blog, we introduce a scalable approach of processing the documents using Feature Factory. The approach is scalable as well as extensible. Developers can leverage functions from LlamaIndex and Langchain, and define their own function via the create-apply pattern. The source code is available to try out for the notebook and LLM tools.

Databricks Community

How to scale up LLM fine tuning with Feature Factory

Introduction

What is RAG and its use cases

What is Feature Factory and how it can help

Data Preparation

Feature Governance

Document Parser

Document Splitter

Metadata Extraction

Integration with Vector Store

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks