cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
megha_upadhyay
Databricks Employee
Databricks Employee

Screenshot 2026-05-05 at 18.04.47.png

 

Introduction

Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have made it possible to have natural-language conversations with data, and for many use cases, this is a genuine breakthrough. But as teams push document intelligence beyond simple Q&A into real business workflows, a gap emerges: documents are not just prose. They are full of numbers, dates, amounts, and counts that demand precision, not approximation. This blog walks us through that gap, why it exists, why RAG alone is not designed to close it, and how Databricks' integrated AI capabilities come together to deliver a document intelligence system that handles both the qualitative "what does this say?" and the quantitative "what do the numbers add up to?"

 

The Question That Broke Our RAG System

Imagine a retail operations team managing hundreds of stores across multiple grocery chains. Every month, thousands of invoices flow in, each containing item-level details, discounts, totals, payment terms, and return policies. Simple questions like "What is Store A’s return policy?" or "What discounts were applied on this invoice?" had no easy answer; teams were either manually digging through documents or waiting days for a data team to run a report.

To solve this, we built what we thought was a solid document intelligence system. Thousands of invoices from these chains were parsed, chunked, embedded, and indexed. A user could now ask the above questions and get a perfect answer, grounded in the actual document text, with source citations. Classic RAG, working exactly as designed.

Then someone asked: "Which store had the highest total sales in Q1?"

The system confidently returned a wrong answer. It had retrieved three invoice chunks that happened to mention large dollar amounts, pattern-matched its way to an answer, and served it up with full conviction. No SQL was run. No actual aggregation happened. The LLM simply “read some numbers in some chunks and guessed”.

Or it also happens at times that the system will return no answer and simply say not enough information instead of doing guesswork, either way, we would not end up with the response we wanted. 

That moment reframed our entire approach. The problem was not that RAG was broken. The problem was that we were asking it to do something it was never built to do.

 

Documents Have Two Faces

Every business document, whether it is a contract, an invoice, a financial report, or a regulatory filing, carries two types of knowledge:

  • Qualitative knowledge lives in the words. It is the kind of information you read and understand, e.g. the return policy buried in paragraph four, the risk factor described in a footnote, the strategic rationale explained in an executive summary. This is the domain of meaning, context, and nuance.
  • Quantitative knowledge lives in the numbers. It is the kind of information you calculate, e.g. the contract value, the line-item total, the count of deliverables, the percentage change quarter over quarter. This is the domain of precision, aggregation, and arithmetic.

When we say that roughly 80-90% of enterprise data is unstructured, what we are really saying is that a massive volume of both qualitative and quantitative knowledge is locked inside documents that traditional analytics pipelines cannot touch. We wanted to build a document intelligence system to unlock both.

RAG was designed brilliantly to unlock the first.

 

Why RAG Excels at Qualitative Understanding

RAG's architecture is elegant in its simplicity. Parse documents into chunks, embed those chunks as vectors that capture semantic meaning, store them in a vector database, and at query time, retrieve the chunks most similar to the user's question. Hand those chunks to an LLM as context, and the model generates an answer grounded in your actual data rather than guesswork.

Screenshot 2026-05-04 at 19.38.32.png

Figure 1: RAG Architecture

This works exceptionally well for qualitative questions, such as:

  • Summarise the key findings in the audit report.
  • What are the payment terms in the ABC contract?
  • Explain the risk factors mentioned in the Q3 filing.

RAG is the right tool for this job. But an assumption that the same mechanism can handle an entirely different class of questions is a problem.

 

Where the Problem Arises

Consider these questions against a corpus of 500 invoices:

  • How many invoices are from Store A?

RAG retrieves, say, seven chunks that mention "Store A." But some invoices have multiple chunks, and some Store A related chunks were not retrieved because their text focused on payment terms rather than store identity, giving them low similarity to the query. The LLM looks at the retrieved chunks and says "approximately 7"; the actual answer may be 12.

  • Or what is the return policy for the store with the highest number of orders?

This is an interesting question as it requires two things in sequence: first, a precise count and comparison across structured data (which store has the most orders?), and then a qualitative lookup (what is that store's return policy?). RAG has no way to chain these steps. It cannot count orders, rank stores, identify the winner, and then pivot to a policy lookup. It is a single-pass retrieval system being asked to perform multi-step reasoning across two different knowledge types.

The core issue is architectural, not a bug to be fixed:

  • Chunking fragments the numerical context. An invoice total on page 2 gets separated from the line items on page 1. The embedding captures the semantics of each fragment independently, losing the arithmetic relationship between them.
  • Embedding similarity is not arithmetic. Vector search finds passages that are about similar topics. It does not and cannot sum, count, filter, or aggregate. “Most similar" and "mathematically correct" are fundamentally different operations.
  • LLMs reason over language, not over databases. When you hand an LLM seven text chunks containing numbers, it will do its best to narratively weave an answer. But it is generating language, not executing calculations. There is no SUM(), AVG() happening under the hood.

RAG is not failing here. It is being asked to be something it is not meant to be i.e. a query engine over structured data. Recognising this distinction is the key to building systems that actually work.

 

The Design Question

Once we understood the problem clearly, the design question became obvious.

What if we could give every document two lives, one as searchable text for qualitative questions, and one as structured data for quantitative questions, and then have an intelligent router decide which path each question should take?

This is exactly what Databricks enables, not through a single product, but through four capabilities working together.

 

Parsing: Giving Documents Structure

The first step is to stop treating documents as flat text. Databricks' ai_parse_document function processes the raw binary of a document and extracts not just the text, but the “structure”: section headers, tables (as HTML), figure descriptions, and the spatial relationships between elements.

SELECT
ai_parse_document(content, map('version', '2.0')) AS parsed
FROM read_files('/path/to/documents', format => 'binaryFile');

image2_12 copy (2).gif

Figure 2: ai_parse_document in action

Supported file types: PDF, JPG, JPEG, PNG, DOC, DOCX, PPT, PPTX.

From each invoice, we now extracted two outputs:

- The full text - concatenated, preserving reading order, ready for chunking and embedding.

- Structured attributes, extracted via an LLM prompt (e.g., "Extract the store name, invoice date, total amount, and payment terms") and written as columns in a Delta Table.

This single parsing step creates both the qualitative and quantitative representations of every document. 

 

The Qualitative Path: Knowledge Assistant

The extracted text follows the familiar RAG pipeline. It is chunked using a recursive text splitter (512-token chunks with 64-token overlap), embedded using the databricks-gte-large-en model, and stored in a Databricks Vector Search index, a managed vector database that stays in sync with the underlying Delta Table.

On top of this index sits a Databricks Knowledge Assistant, built through Databricks Agent Bricks. It is a purpose built retrieval agent, when a user asks a qualitative question, the Knowledge Assistant searches the vector index, retrieves the most relevant chunks, and generates a grounded, cited response.

E.g., "What's Store A’s return policy?

The Knowledge Assistant retrieves the relevant invoice section and answers accurately, with a citation pointing back to the source document.

This path handles everything RAG was built for. It does not try to do more.

 

The Quantitative Path: Genie

The structured attributes, such as store name, invoice date, total amount, item count, payment terms, land in a Delta Table as clean, typed columns. This table is the quantitative twin of the same documents that feed the Knowledge Assistant.

Databricks Genie connects to this table and translates natural-language questions into precise SQL queries, leveraging Unity Catalog metadata and user-defined instructions to understand the domain vocabulary.

E.g., "How many invoices are from Store A?"

Genie translates this to:

SELECT COUNT(*) FROM <invoice_attributes> WHERE store_name = 'Store A';

The answer is a precise count executed against a structured table derived from the same documents.

 

The Orchestrator: Supervisor Agent

Having two separate paths is useful. Having an intelligent router that decides which path to take in real time is what makes this a unified system.

The Databricks Supervisor Agent, also built through Databricks Agent Bricks, sits in front of both the Knowledge Assistant and Genie. When a user asks a question, the Supervisor analyses the intent and routes accordingly:

- Qualitative or descriptive question → Knowledge Assistant (vector search path)

- Quantitative or analytical question → Genie (SQL execution path)

- Hybrid question → Both, as per the required sequence

That third category is where the real power lives.

E.g., "What is the return policy for the store with the highest number of orders?"

The Supervisor breaks this down:

  1. Route to Genie: "Which store has the highest number of orders?
  2. Route to Knowledge Assistant: "What is its return policy?

The user sees a single, seamless response. Behind the scenes, two fundamentally different systems collaborated, one doing arithmetic, the other doing comprehension, orchestrated by an agent that understood the question required both.

No RAG system, however sophisticated, can do this alone. And it is not supposed to. This is a different architecture for a different class of problem.

 

Architecture at a Glance

Screenshot 2026-05-05 at 17.59.09.png

Figure 3: A Unified Architecture

Everything flows from a single parsing step. The same document feeds both paths. The Supervisor decides, per question, in real time, which path to invoke. The user never needs to know which system answered. They just get the right answer.

 

Side by Side: What Changes

To make the difference concrete, here is how a sample question plays out across the two approaches:

megha_upadhyay_3-1777902694016.png

Figure 4: A Fake Sample Invoice

 

Untitled design (2).gif

Figure 5: Rag-Only(Left) v/s Databricks Supervisor(Right)

"What is the return policy for the store with the highest number of orders?"  

Rag-Only approach: Cannot decompose into sub-tasks; retrieves loosely related chunks and guesses
Databricks Supervisor: Genie identifies the top store (Costgo), then KA retrieves Costgo’s return policy, precise and grounded 

Following this architecture does not penalise qualitative questions; it extends the system's reach to questions RAG was never designed to answer.

 

Beyond Grocery Bills

We used invoices to tell this story because they make the problem tangible. But the pattern applies everywhere documents contain both narrative and numbers, such as contracts, financial reports, regulatory filings etc. 

In every domain, the same architectural pattern holds: qualitative knowledge needs semantic search, quantitative knowledge needs structured query execution, and real-world questions often need both.

 

Conclusion

RAG transformed how we interact with unstructured data. It gave us the ability to ask questions in natural language and get answers grounded in our own documents. That was a genuine breakthrough, and it remains the right approach for qualitative understanding.

But documents are not purely qualitative. They are full of numbers, dates, amounts, and counts that demand precision, not approximation. Asking RAG to sum invoices is like asking a librarian to do your taxes. They are exceptionally good at finding the right book, but that is not the same skill as arithmetic.

Databricks' approach does not replace RAG. It completes it. By giving every document two lives, one as searchable text and the other as structured data, routing each question to the system best equipped to answer it, organisations can finally build document intelligence that is both contextually rich and numerically precise.

The librarian and the accountant, working together, with a manager who knows which one to call.

 

References