cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Daniel-Liden
Databricks Employee
Databricks Employee

Building a Research Paper Curator for Knowledge Assistants

Document-heavy workflows such as research analysis, contract review, and support ticket processing, are a natural fit for AI agents. Building them from scratch means orchestrating document parsing, information extraction, vector search, chat interfaces, and likely much more. That’s a lot of plumbing before you get to the interesting part.

Databricks Agent Bricks and the ai_parse_document SQL function eliminate most of that plumbing. In this post, we’ll show how these tools work together by building an application for curating research papers for a knowledge assistant. Using this example, we’ll highlight patterns you can apply to your own agentic applications on Databricks.

What we’ll use:

This is a high-level structural overview, so we won’t walk through every line of code. Instead, we’ll focus on how these components interact and unlock value together. For the full implementation, check out the project on GitHub.

The Problem: Knowledge Assistants Need Curation

Knowledge assistants such as the Databricks Agent Bricks Knowledge Assistant work best when they have access to focused, relevant content. Overloading them with irrelevant materials can lead to knowledge assistants retrieving only some of the relevant information they need to address a query or, worse, retrieving entirely incorrect information (e.g., 1, 2).

Manually pre-screening materials for inclusion in a knowledge assistant can be a cumbersome and time-consuming task, especially when it comes to technical documents like research papers. You may need to search through dozens of pages of text to determine whether a paper warrants inclusion in your knowledge assistant.

The Solution: A Curation Workflow

In order to make the screening process as painless as possible, we can make use of several AI features on Databricks to pull out some key details from the research papers and a Databricks app to create a simple UI for flagging papers to include.

DanielLiden_1-1768248344551.png

The curator app: search arXiv, parse and extract with AI, manage your knowledge base

To accomplish this, we will:

  1. Search the arXiv API for papers based on topics and keywords
  2. Get the text from promising candidates with the ai_parse_document SQL function
  3. Extract key details from the candidate papers with Agent Bricks Information Extraction
  4. Review the extracted fields, which are targeted and relevant to the criteria we’re using to filter papers for inclusion in our Knowledge Assistant. This is much quicker than manual review!
  5. Add relevant papers to the knowledge assistant.

DanielLiden_0-1768248344550.png

Document search, parsing, and review workflow for knowledge assistant curation

Setting Up the Agents

Agent Bricks agents are tailored to specific tasks and data, so we need to initialize the Information Extraction and Knowledge Assistant agents before we can integrate them into our application.

After the agents are configured (via the Databricks UI), we will invoke them via the OpenAI SDK. Agent Bricks endpoints are OpenAI-compatible, so you can use the familiar OpenAI SDK pattern:

from databricks.sdk import WorkspaceClient

ws_client = WorkspaceClient()
openai_client = ws_client.serving_endpoints.get_open_ai_client()

response = openai_client.chat.completions.create(
    model="your-agent-endpoint",
    messages=[{"role": "user", "content": "Your prompt here"}],
)

This pattern works for both KIE and Knowledge Assistant endpoints. The WorkspaceClient handles authentication automatically.

Here’s how to set up the Agent Bricks agents.

Key Information Extraction (KIE)

The KIE agent extracts structured fields from parsed documents, giving you a quick preview of each paper’s contributions, methodology, and limitations without reading 30 pages. You configure what to extract by defining a JSON schema, then point the agent at your parsed text. The full KIE documentation covers all configuration options.

Preparing the data. Before KIE can extract anything, PDFs need to be converted to text. We started with a golden set of seminal LLM agent papers (ReAct, Reflexion, etc.) and parsed them using the ai_parse_document SQL function:

SELECT ai_parse_document(content) as parsed
FROM read_files('/Volumes/catalog/schema/volume/paper.pdf')

The ai_parse_document function handles multi-column layouts, equations, and citations, returning structured JSON with page-by-page content. We stored the extracted text in a parsed_documents table, which becomes the KIE agent’s data source.

Defining the schema. The Agent Bricks UI generates a starter schema based on your data, but you’ll want to replace it with fields relevant to your use case. Switch to JSON schema mode and define what you need. For paper screening, we defined seven fields: title, authors, affiliation, methodology, contributions, limitations, and topics. Here’s an excerpt:

{
  "properties": {
    ...
    "methodology": {
      "type": "string",
      "description": "Research methods, model architectures, and experimental approaches used."
    },
    "limitations": {
      "type": "array",
      "items": {"type": "string"},
      "description": "Acknowledged weaknesses, constraints, and areas for improvement."
    },
    ...
  }
}

The description fields matter: they guide the model’s extraction logic. If you’re getting inconsistent results, refining the description often helps more than adding examples. We also used anyOf with null for optional fields (like affiliation) so the model returns null when information isn’t present rather than hallucinating.

DanielLiden_2-1768248344552.png

KIE agent configuration showing extraction schema and sample outputs

Knowledge Assistant

The Knowledge Assistant provides RAG-based Q&A over your documents. Setup is straightforward: point it at a Unity Catalog Volume and deploy. The full Knowledge Assistant documentation covers additional options like vector search indexes and custom instructions.

Configuring the data source. In the Agent Bricks UI, select Knowledge Assistant and click Build. You’ll need to specify:

  • Name: We used arxiv-papers
  • Knowledge Source: Select Unity Catalog Volume and point it at your PDFs volume (e.g., arxiv_demo.main.pdfs)
  • Instructions (optional): Guidelines for how the agent should respond

Only documents in this volume will be indexed. For the curator application, we set up a separate staging volume for papers under review. documents only move to the KA volume when explicitly approved. This keeps the knowledge base focused.

After deployment, the agent takes a few minutes to sync and index your documents. You can test it immediately via the embedded chat or AI Playground. If you add or remove files from the volume later, click Sync in the UI to update the index.

DanielLiden_3-1768248344553.png

Knowledge Assistant configured with a UC Volume source

The Curation Workflow

We built a Databricks App with a four-phase workflow that calls upon these agents:

Phase 1: Search

Search arXiv for papers matching your criteria. The UI lets you filter by category, date range, and keywords. You can select papers that look like they might be relevant to your interests and worth including in the knowledge assistant.

DanielLiden_4-1768248344553.png

Search interface with category filters and date range

Phase 2: Review (Parse + Extract)

Once you select some papers, you can click the “Parse Papers” button beneath the list. This triggers the following:

  1. The selected papers are downloaded and uploaded to a staging volume. Papers in the staging volume are not accessible by the Knowledge Assistant.
  2. The ai_parse_document SQL function is used to obtain the text of the papers and save them to a Unity Catalog table.
  3. The Information Extraction agent we defined previously extracts the relevant fields from the papers.

Once the parsing and extraction are complete (which might take a few minutes, depending on the number of papers), you can review the key contributions, methods, and limitations of each paper in the review tab. From the review tab, you can select which papers to add to the knowledge assistant.

DanielLiden_5-1768248344554.png

KIE-extracted insights showing key contributions and methodology

Phase 3: Manage

The KA Manager shows what’s currently in your knowledge base. You can add or remove documents. Keeping the knowledge assistant up to date with relevant materials helps to keep the results targeted and relevant over time.

DanielLiden_6-1768248344555.png

Knowledge Assistant Manager showing curated papers

Phase 4: Chat

Query your curated knowledge base. Because you’ve been selective about what goes in, responses are focused and relevant.

DanielLiden_7-1768248344555.png

Chat interface querying the curated knowledge base

There are several other ways you can chat with the knowledge assistant. You can use the Knowledge Assistant UI directly in Databricks by clicking on the “Agents” tab, selecting your knowledge assistant, and using the Knowledge Assistant chat interface. You can also use the Databricks AI Playground by clicking the Playground tab and finding your Knowledge Assistant endpoint in the model selection dropdown. We included the chat interface directly in the curator application primarily for convenience.

Key Takeaway

We used a paper curator as our example, but the real takeaway is how quickly you can build document-processing agents on Databricks. Agent Bricks + ai_parse_document handle the hard parts, so you can focus on your use case.

  • Knowledge Assistant gives you a Q&A bot over your documents in just a few UI-driven steps, without needing to worry about configuring a vector database or chat interface yourself.
  • The Information Extraction agent can obtain key details from complex documents and return them in a structure you define. Again, this takes only a few UI-driven steps.
  • ai_parse_document makes it easy to extract clean, structured text from documents and excels at complicated documents like research papers that may include tables and figures.
  • Databricks Apps let you define custom application logic for tying all the pieces together.

In this example, we developed a curator app that simplifies the process of adding relevant data to a knowledge assistant. But the broader principles apply across use cases. With Databricks Agent Bricks, creating performant, high-quality agents that integrate seamlessly with your applications and workflows has never been easier, and they’re even more powerful when used together.

Try It Yourself

Check out the project on GitHub. To get started, import the repository to your Databricks workspace and follow the steps in the project’s runbook. This will guide you through the process of setting up the agents, creating your agents, and deploying the application.

1 Comment
npitts
New Contributor

this is fire. thanks for sharing