Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have made it possible to have natural-language conversations with data, and for many use cases, this is a genuine breakthrough. But as teams push document intelligence beyond simple Q&A into real business workflows, a gap emerges: documents are not just prose. They are full of numbers, dates, amounts, and counts that demand precision, not approximation. This blog walks us through that gap, why it exists, why RAG alone is not designed to close it, and how Databricks' integrated AI capabilities come together to deliver a document intelligence system that handles both the qualitative "what does this say?" and the quantitative "what do the numbers add up to?"
Imagine a retail operations team managing hundreds of stores across multiple grocery chains. Every month, thousands of invoices flow in, each containing item-level details, discounts, totals, payment terms, and return policies. Simple questions like "What is Store A’s return policy?" or "What discounts were applied on this invoice?" had no easy answer; teams were either manually digging through documents or waiting days for a data team to run a report.
To solve this, we built what we thought was a solid document intelligence system. Thousands of invoices from these chains were parsed, chunked, embedded, and indexed. A user could now ask the above questions and get a perfect answer, grounded in the actual document text, with source citations. Classic RAG, working exactly as designed.
Then someone asked: "Which store had the highest total sales in Q1?"
The system confidently returned a wrong answer. It had retrieved three invoice chunks that happened to mention large dollar amounts, pattern-matched its way to an answer, and served it up with full conviction. No SQL was run. No actual aggregation happened. The LLM simply “read some numbers in some chunks and guessed”.
Or it also happens at times that the system will return no answer and simply say not enough information instead of doing guesswork, either way, we would not end up with the response we wanted.
That moment reframed our entire approach. The problem was not that RAG was broken. The problem was that we were asking it to do something it was never built to do.
Every business document, whether it is a contract, an invoice, a financial report, or a regulatory filing, carries two types of knowledge:
When we say that roughly 80-90% of enterprise data is unstructured, what we are really saying is that a massive volume of both qualitative and quantitative knowledge is locked inside documents that traditional analytics pipelines cannot touch. We wanted to build a document intelligence system to unlock both.
RAG was designed brilliantly to unlock the first.
RAG's architecture is elegant in its simplicity. Parse documents into chunks, embed those chunks as vectors that capture semantic meaning, store them in a vector database, and at query time, retrieve the chunks most similar to the user's question. Hand those chunks to an LLM as context, and the model generates an answer grounded in your actual data rather than guesswork.
Figure 1: RAG Architecture
This works exceptionally well for qualitative questions, such as:
RAG is the right tool for this job. But an assumption that the same mechanism can handle an entirely different class of questions is a problem.
Consider these questions against a corpus of 500 invoices:
RAG retrieves, say, seven chunks that mention "Store A." But some invoices have multiple chunks, and some Store A related chunks were not retrieved because their text focused on payment terms rather than store identity, giving them low similarity to the query. The LLM looks at the retrieved chunks and says "approximately 7"; the actual answer may be 12.
This is an interesting question as it requires two things in sequence: first, a precise count and comparison across structured data (which store has the most orders?), and then a qualitative lookup (what is that store's return policy?). RAG has no way to chain these steps. It cannot count orders, rank stores, identify the winner, and then pivot to a policy lookup. It is a single-pass retrieval system being asked to perform multi-step reasoning across two different knowledge types.
The core issue is architectural, not a bug to be fixed:
RAG is not failing here. It is being asked to be something it is not meant to be i.e. a query engine over structured data. Recognising this distinction is the key to building systems that actually work.
Once we understood the problem clearly, the design question became obvious.
What if we could give every document two lives, one as searchable text for qualitative questions, and one as structured data for quantitative questions, and then have an intelligent router decide which path each question should take?
This is exactly what Databricks enables, not through a single product, but through four capabilities working together.
The first step is to stop treating documents as flat text. Databricks' ai_parse_document function processes the raw binary of a document and extracts not just the text, but the “structure”: section headers, tables (as HTML), figure descriptions, and the spatial relationships between elements.
SELECT
ai_parse_document(content, map('version', '2.0')) AS parsed
FROM read_files('/path/to/documents', format => 'binaryFile');
Figure 2: ai_parse_document in action
Supported file types: PDF, JPG, JPEG, PNG, DOC, DOCX, PPT, PPTX.
From each invoice, we now extracted two outputs:
- The full text - concatenated, preserving reading order, ready for chunking and embedding.
- Structured attributes, extracted via an LLM prompt (e.g., "Extract the store name, invoice date, total amount, and payment terms") and written as columns in a Delta Table.
This single parsing step creates both the qualitative and quantitative representations of every document.
The extracted text follows the familiar RAG pipeline. It is chunked using a recursive text splitter (512-token chunks with 64-token overlap), embedded using the databricks-gte-large-en model, and stored in a Databricks Vector Search index, a managed vector database that stays in sync with the underlying Delta Table.
On top of this index sits a Databricks Knowledge Assistant, built through Databricks Agent Bricks. It is a purpose built retrieval agent, when a user asks a qualitative question, the Knowledge Assistant searches the vector index, retrieves the most relevant chunks, and generates a grounded, cited response.
E.g., "What's Store A’s return policy?"
The Knowledge Assistant retrieves the relevant invoice section and answers accurately, with a citation pointing back to the source document.
This path handles everything RAG was built for. It does not try to do more.
The structured attributes, such as store name, invoice date, total amount, item count, payment terms, land in a Delta Table as clean, typed columns. This table is the quantitative twin of the same documents that feed the Knowledge Assistant.
Databricks Genie connects to this table and translates natural-language questions into precise SQL queries, leveraging Unity Catalog metadata and user-defined instructions to understand the domain vocabulary.
E.g., "How many invoices are from Store A?"
Genie translates this to:
SELECT COUNT(*) FROM <invoice_attributes> WHERE store_name = 'Store A';
The answer is a precise count executed against a structured table derived from the same documents.
Having two separate paths is useful. Having an intelligent router that decides which path to take in real time is what makes this a unified system.
The Databricks Supervisor Agent, also built through Databricks Agent Bricks, sits in front of both the Knowledge Assistant and Genie. When a user asks a question, the Supervisor analyses the intent and routes accordingly:
- Qualitative or descriptive question → Knowledge Assistant (vector search path)
- Quantitative or analytical question → Genie (SQL execution path)
- Hybrid question → Both, as per the required sequence
That third category is where the real power lives.
E.g., "What is the return policy for the store with the highest number of orders?"
The Supervisor breaks this down:
The user sees a single, seamless response. Behind the scenes, two fundamentally different systems collaborated, one doing arithmetic, the other doing comprehension, orchestrated by an agent that understood the question required both.
No RAG system, however sophisticated, can do this alone. And it is not supposed to. This is a different architecture for a different class of problem.
Figure 3: A Unified Architecture
Everything flows from a single parsing step. The same document feeds both paths. The Supervisor decides, per question, in real time, which path to invoke. The user never needs to know which system answered. They just get the right answer.
To make the difference concrete, here is how a sample question plays out across the two approaches:
Figure 4: A Fake Sample Invoice
Figure 5: Rag-Only(Left) v/s Databricks Supervisor(Right)
"What is the return policy for the store with the highest number of orders?"
Rag-Only approach: Cannot decompose into sub-tasks; retrieves loosely related chunks and guesses
Databricks Supervisor: Genie identifies the top store (Costgo), then KA retrieves Costgo’s return policy, precise and grounded
Following this architecture does not penalise qualitative questions; it extends the system's reach to questions RAG was never designed to answer.
We used invoices to tell this story because they make the problem tangible. But the pattern applies everywhere documents contain both narrative and numbers, such as contracts, financial reports, regulatory filings etc.
In every domain, the same architectural pattern holds: qualitative knowledge needs semantic search, quantitative knowledge needs structured query execution, and real-world questions often need both.
RAG transformed how we interact with unstructured data. It gave us the ability to ask questions in natural language and get answers grounded in our own documents. That was a genuine breakthrough, and it remains the right approach for qualitative understanding.
But documents are not purely qualitative. They are full of numbers, dates, amounts, and counts that demand precision, not approximation. Asking RAG to sum invoices is like asking a librarian to do your taxes. They are exceptionally good at finding the right book, but that is not the same skill as arithmetic.
Databricks' approach does not replace RAG. It completes it. By giving every document two lives, one as searchable text and the other as structured data, routing each question to the system best equipped to answer it, organisations can finally build document intelligence that is both contextually rich and numerically precise.
The librarian and the accountant, working together, with a manager who knows which one to call.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.