You want your RAG solution (based on Databricks Cookbook) to display PDF files at specific pages as references in your Review App, rather than plain text chunks. You also wish to retrieve source documents from your serving endpoint, but your current code only returns model results, not sources.
Here’s a step-by-step review and actionable improvements to meet your requirements:
1. Opening PDFs at Specific Pages
The standard ReviewApp in Databricks' RAG cookbook is text-focused, and does not support direct PDF page referencing out of the box. To display PDFs and open them at a specified page, you must:
-
Use a PDF viewer component (like Chainlit’s PdfViewer), or another frontend framework.
-
On click of reference, provide both the file path (or URL) and the page number.
-
Store a mapping (e.g., pdf_path, page_no) during retrieval so references in the UI link directly to the PDF at the page you want.
Databricks ReviewApp Modification
-
Not natively supported, would require custom coding in the frontend to integrate a PDF viewer and manage page navigation based on reference metadata.
-
It is easier and more modular to use a framework like Chainlit or Streamlit (with PyPDF2, pdfplumber, or frontend PDF components) to achieve this directly.
2. Do You Need to Build Chainlit App?
Yes, building with Chainlit (or similar) is recommended for:
-
Embedding a PDF viewer.
-
Navigating to a specific page programmatically.
-
Making the app interactive and reference-friendly.
With Chainlit, you can use PdfViewer to show a PDF and control the page based on your reference logic.
3. Calling Serving Point to Return Source Documents
Your model serving endpoint (/invocations) needs to return not just the answer, but also the source document metadata (file path, page, chunk) for referencing.
-
If the model endpoint doesn’t return sources, check:
-
Is your retrieval chain constructed to include sources? In RAG, output should include something like sources or references.
-
Is your serving endpoint returning source_documents in the JSON?
Typical structure for returning sources:
{
"output": "LLM answer...",
"source_documents": [
{
"file": "mydoc.pdf",
"page": 11,
"chunk": "The LLM is..."
}
]
}
Your current code only prints the top-level dictionary items; you may need to check for ['source_documents'] in the response.
How to Get Source Documents
-
Back-end changes:
In your Databricks RAG chain/code, ensure your chain/app is set up to return references:
-
For LangChain: use return_source_documents=True when calling the retriever/chain.
-
For custom solutions: append metadata (file, page) to the returned list.
-
Serving endpoint:
Must be configured to return sources as part of its response.
-
Frontend handling:
Parse the source_documents in the returned JSON and use this info to display or link the correct PDF/page.
4. Practical Code Improvements
Model Request
Ensure your backend is returning the sources:
def score_model(dataset):
...
# Call endpoint as before
response = requests.request(...)
if response.status_code != 200:
...
# Ideally: response.json() contains 'output' and 'source_documents'
return response.json()
Example Response Handling
result = score_model(...)
print("Answer:", result.get("output"))
for src in result.get("source_documents", []):
print(f"File: {src['file']} Page: {src['page']}")
# In frontend: Pass src['file'] and src['page'] to PdfViewer
5. Recommendation Table
| Option |
PDF Page Support |
Effort |
Flexibility |
Source Metadata Required |
| ReviewApp |
No (custom hack) |
Medium |
Low |
Yes |
| Chainlit App |
Yes (PdfViewer) |
Low/Med |
High |
Yes |
References
-
[How to display PDFs and open at a page with Chainlit PdfViewer]
-
[Databricks RAG Cookbook documentation]
Summary:
Change your app to use Chainlit (or extend ReviewApp using a PDF viewer library) for easy PDF page referencing. Ensure your backend returns source metadata, and update frontend code to send file/page info to the viewer.