Databricks Community

dkushari · ‎05-16-2024

Keeping your Databricks Direct Vector Access Index fresh in near real time

Databricks Vector Search is a vector database that is built into the Databricks Data Intelligence Platform and integrated with its governance and productivity tools. A vector database is optimized to store and retrieve embeddings. Embeddings are mathematical representations of the semantic content of data, typically text or image data. Embeddings are generated by a Large Language Model (LLM) and are a key component of many GenAI applications that depend on finding documents or images that are similar to each other. Examples are Retrieval Augmented Generation (RAG) systems, recommender systems, and image and video recognition.

For more details on how to create a Vector Search Index, how it works and how the similarity search works in Vector Search Index you can refer to here.

There are two types of Vector Search Indexes on Databricks platform -

Delta Sync Index, which automatically syncs with a source Delta Table, automatically and incrementally updating the index as the underlying data in the Delta Table changes.
Direct Vector Access Index, which supports direct read and write of vectors and metadata. The user is responsible for updating this table using the REST API or the Python SDK. This type of index is created using REST API or the SDK.

We recommend that you use Delta Sync Index if your use case supports it. Delta Sync Index provides easy to use automatic and managed ETL pipeline to keep Vector Index up to date and you can use Databricks managed embeddings computation. However, if you need more flexibility to perform CRUD (Create, read, update and delete) operations of your data in your Vector Index, or you already have done embedding computation using your self-managed embedding LLM and its ready to be ingested into the Databricks Vector Index, you would want to use Direct Vector Access Index.

Unlike Delta Sync Index, which uses managed DLT to automatically and incrementally update the index as the underlying data in the Delta Table changes, Direct Vector Access Index does not have a built in syncing process. You have to build and manage your own pipeline to keep the data fresh in your Direct Vector Access Index. And we know Data Freshness is imperative for any trustworthy database and Vector Index is no exception. In this blog I will build a ETL pipeline to keep the data fresh in Near Real Time for the Direct Vector Access Index as new documents are ingested from source. I will also use this Direct Vector Access Index to deploy a real-time Q&A chatbot using Databricks retrieval augmented generation (RAG) and serverless capabilities, leveraging the DBRX Instruct Foundation Model for smart responses against Databricks documentations (ingested as PDF). If you want to learn more about RAG and how you can build it in Databricks Data Intelligent Platform you can refer to dbdemos.

How does Direct Vector Access Index search work?

You provide a source Delta table that contains pre-calculated embeddings. There is no automatic syncing when the Delta table is updated. You must manually update the index using the REST API when the embeddings table changes.

The following diagram illustrates the process,

ETL pipeline for keeping Direct Vector Access Index updated

The diagram at the end of this section below shows the highlights of the ETL pipeline which is triggered every time there is a new file dropped into the Unity Catalog Volume. Databricks Auto Loader is used to ingest new data in the landing table. Latest records are merged into the staging table from the landing table. Staging table data is then chunked and converted into embeddings (using BGE Large (English) Foundation Model in Databricks) before being inserted into the Vector Index. In case there is an update to an existing document, all the chunks of the existing documents are deleted from the Vector Index before inserting all the newly arrived records into the Vector Index. This is done to ensure that all the data from an existing document is completely refreshed by the newly arrived version of the same document.

Design principles

Databricks identity column is used while creating the pdf_land table. The identity column doc_id identifies each ingested document with an unique id. Document with the same path is identified with a separate unique doc_id during ingestion.
While merging into pdf_stage table using pdf_land table the complete path of the document in the volume is used to identify different versions of the same document and previously used doc_id is assigned to the newly arriving record. All the data content including timestamp is updated in the pdf_stage table for any existing document except the doc_id. This ensures that all the newly available content for the same doc is considered for ingestion into Vector Index at the same time the document can be identified uniquely and not duplicated.
pdf_processed table also uses an identity column (id) to identify each record uniquely after the chunking process is done. This column id is also used as the Primary Key for the Vector Index.
During incremental processing, when the same document arrives again, id values are fetched as a list for all the records filtered by the doc_id in the pdf_processed table. All the records with matching ids are deleted from the Vector Index before the newly created chunks of all the documents are inserted into the Vector Index.
While pdf_processed only holds the latest chunked records which are being processed at a point in time and is always fully refreshed, pdf_bkup table keeps an up to date copy of all the chunked data and at any point in time has the exact same content that of the Vector Index. This backup delta table is maintained to avoid any data loss.

Notebooks available in this git repository illustrates the ETL process described in this section.

Implementation of RAG using the Direct Vector Access Index

Retrieval Augmented Generation (RAG) is a generative AI design pattern that involves combining a Large Language Model (LLM) with external knowledge retrieval. RAG is required to connect real-time data to your generative AI applications. Doing so improves the accuracy and quality of the application, by providing your data as context to the LLM at inference time.

In the previous section we have designed a mechanism to keep data fresh in near real time for the Vector Index. Let us use that to create a Q&A chatbot using RAG and Databricks serverless capabilities, leveraging the DBRX Instruct Foundation Model for smart responses against Databricks documentations.

Here are the high level steps for building the Q&A chatbot application using RAG -

For RAG, we need to set up the Direct Vector Access Index that we talked about in the ETL step, including an endpoint with access to the index.
Register a chat model in UC, deploy it to the model serving endpoint, and run some tests on it to ensure acceptable performance with your Direct Vector Access Index.
Update the Index and re-run the tests to check performance with newly processed data.

The architecture of this solution can be seen in the following diagram. If you want the code to build this yourself, see the git repository with notebooks that walk through all of these steps. Description of high level steps used in the notebooks are also provided in the Appendix section below.

Conclusion

In this blog we saw how we can build an ETL pipeline to keep data fresh in a Databricks Direct Vector Access Index. We also saw how we can quickly build a Q&A chatbot using Databricks retrieval augmented generation (RAG) and serverless capabilities, leveraging the DBRX Instruct Foundation Model for smart responses against Databricks documentations.

Use Databricks Vector Index and Foundation Model APIs to build your own Q&A chatbot employing Retrieval Augmented Generation (RAG) application architecture.

Appendix

High level steps for each notebook are given below. Notebooks are available in this git repository.

01-Direct Vector Access Index
This notebook sets up:

The ingestion process of the Databricks Documents in PDF format via Unity Catalog Volume.
A vector search endpoint and Direct Vector Access Index.
An End to end ETL pipeline for the initial processing of the ingested records, creating embeddings and inserting into the Vector Index.

02-Deploy-RAG-Chatbot-Model
This notebook:

Deploys a Q&A chatbot model by using the Databricks Foundational DBRX LLM model and the Direct Vector Access Index created by the first notebook and performing prompt engineering.
Registers the model in Databricks Unity Catalog and creates a Model Serving endpoint.
Performs tests by asking questions to the chatbot model and evaluating the answer.
Performs upserts into the Vector Index by ingesting additional PDF files, including a newer version of an existing PDF document.
Re-executes tests by asking the same questions to the chatbot model and evaluating the answer.

Databricks Community

Keeping Your Databricks Direct Vector Access Index Fresh in Near Real Time

Keeping your Databricks Direct Vector Access Index fresh in near real time

How does Direct Vector Access Index search work?

ETL pipeline for keeping Direct Vector Access Index updated

Design principles

Implementation of RAG using the Direct Vector Access Index

Conclusion

Appendix

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks