Document-heavy workflows such as research analysis, contract review, and support ticket processing, are a natural fit for AI agents. Building them from scratch means orchestrating document parsing, information extraction, vector search, chat interfaces, and likely much more. That’s a lot of plumbing before you get to the interesting part.
Databricks Agent Bricks and the ai_parse_document SQL function eliminate most of that plumbing. In this post, we’ll show how these tools work together by building an application for curating research papers for a knowledge assistant. Using this example, we’ll highlight patterns you can apply to your own agentic applications on Databricks.
What we’ll use:
This is a high-level structural overview, so we won’t walk through every line of code. Instead, we’ll focus on how these components interact and unlock value together. For the full implementation, check out the project on GitHub.
Knowledge assistants such as the Databricks Agent Bricks Knowledge Assistant work best when they have access to focused, relevant content. Overloading them with irrelevant materials can lead to knowledge assistants retrieving only some of the relevant information they need to address a query or, worse, retrieving entirely incorrect information (e.g., 1, 2).
Manually pre-screening materials for inclusion in a knowledge assistant can be a cumbersome and time-consuming task, especially when it comes to technical documents like research papers. You may need to search through dozens of pages of text to determine whether a paper warrants inclusion in your knowledge assistant.
In order to make the screening process as painless as possible, we can make use of several AI features on Databricks to pull out some key details from the research papers and a Databricks app to create a simple UI for flagging papers to include.
The curator app: search arXiv, parse and extract with AI, manage your knowledge base
To accomplish this, we will:
Document search, parsing, and review workflow for knowledge assistant curation
Agent Bricks agents are tailored to specific tasks and data, so we need to initialize the Information Extraction and Knowledge Assistant agents before we can integrate them into our application.
After the agents are configured (via the Databricks UI), we will invoke them via the OpenAI SDK. Agent Bricks endpoints are OpenAI-compatible, so you can use the familiar OpenAI SDK pattern:
from databricks.sdk import WorkspaceClient
ws_client = WorkspaceClient()
openai_client = ws_client.serving_endpoints.get_open_ai_client()
response = openai_client.chat.completions.create(
model="your-agent-endpoint",
messages=[{"role": "user", "content": "Your prompt here"}],
)
This pattern works for both KIE and Knowledge Assistant endpoints. The WorkspaceClient handles authentication automatically.
Here’s how to set up the Agent Bricks agents.
The KIE agent extracts structured fields from parsed documents, giving you a quick preview of each paper’s contributions, methodology, and limitations without reading 30 pages. You configure what to extract by defining a JSON schema, then point the agent at your parsed text. The full KIE documentation covers all configuration options.
Preparing the data. Before KIE can extract anything, PDFs need to be converted to text. We started with a golden set of seminal LLM agent papers (ReAct, Reflexion, etc.) and parsed them using the ai_parse_document SQL function:
SELECT ai_parse_document(content) as parsed
FROM read_files('/Volumes/catalog/schema/volume/paper.pdf')
The ai_parse_document function handles multi-column layouts, equations, and citations, returning structured JSON with page-by-page content. We stored the extracted text in a parsed_documents table, which becomes the KIE agent’s data source.
Defining the schema. The Agent Bricks UI generates a starter schema based on your data, but you’ll want to replace it with fields relevant to your use case. Switch to JSON schema mode and define what you need. For paper screening, we defined seven fields: title, authors, affiliation, methodology, contributions, limitations, and topics. Here’s an excerpt:
{
"properties": {
...
"methodology": {
"type": "string",
"description": "Research methods, model architectures, and experimental approaches used."
},
"limitations": {
"type": "array",
"items": {"type": "string"},
"description": "Acknowledged weaknesses, constraints, and areas for improvement."
},
...
}
}
The description fields matter: they guide the model’s extraction logic. If you’re getting inconsistent results, refining the description often helps more than adding examples. We also used anyOf with null for optional fields (like affiliation) so the model returns null when information isn’t present rather than hallucinating.
KIE agent configuration showing extraction schema and sample outputs
The Knowledge Assistant provides RAG-based Q&A over your documents. Setup is straightforward: point it at a Unity Catalog Volume and deploy. The full Knowledge Assistant documentation covers additional options like vector search indexes and custom instructions.
Configuring the data source. In the Agent Bricks UI, select Knowledge Assistant and click Build. You’ll need to specify:
Only documents in this volume will be indexed. For the curator application, we set up a separate staging volume for papers under review. documents only move to the KA volume when explicitly approved. This keeps the knowledge base focused.
After deployment, the agent takes a few minutes to sync and index your documents. You can test it immediately via the embedded chat or AI Playground. If you add or remove files from the volume later, click Sync in the UI to update the index.
Knowledge Assistant configured with a UC Volume source
We built a Databricks App with a four-phase workflow that calls upon these agents:
Search arXiv for papers matching your criteria. The UI lets you filter by category, date range, and keywords. You can select papers that look like they might be relevant to your interests and worth including in the knowledge assistant.
Search interface with category filters and date range
Once you select some papers, you can click the “Parse Papers” button beneath the list. This triggers the following:
Once the parsing and extraction are complete (which might take a few minutes, depending on the number of papers), you can review the key contributions, methods, and limitations of each paper in the review tab. From the review tab, you can select which papers to add to the knowledge assistant.
KIE-extracted insights showing key contributions and methodology
The KA Manager shows what’s currently in your knowledge base. You can add or remove documents. Keeping the knowledge assistant up to date with relevant materials helps to keep the results targeted and relevant over time.
Knowledge Assistant Manager showing curated papers
Query your curated knowledge base. Because you’ve been selective about what goes in, responses are focused and relevant.
Chat interface querying the curated knowledge base
There are several other ways you can chat with the knowledge assistant. You can use the Knowledge Assistant UI directly in Databricks by clicking on the “Agents” tab, selecting your knowledge assistant, and using the Knowledge Assistant chat interface. You can also use the Databricks AI Playground by clicking the Playground tab and finding your Knowledge Assistant endpoint in the model selection dropdown. We included the chat interface directly in the curator application primarily for convenience.
We used a paper curator as our example, but the real takeaway is how quickly you can build document-processing agents on Databricks. Agent Bricks + ai_parse_document handle the hard parts, so you can focus on your use case.
In this example, we developed a curator app that simplifies the process of adding relevant data to a knowledge assistant. But the broader principles apply across use cases. With Databricks Agent Bricks, creating performant, high-quality agents that integrate seamlessly with your applications and workflows has never been easier, and they’re even more powerful when used together.
Check out the project on GitHub. To get started, import the repository to your Databricks workspace and follow the steps in the project’s runbook. This will guide you through the process of setting up the agents, creating your agents, and deploying the application.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.