2 weeks ago
I’ve been experimenting with the Databricks AI functions and recently ran a small test extracting structured information from a PDF document.
My initial approach was to use ai_parse_document to extract the text from the PDF.
While the function appeared to work at first glance, I noticed some transcription inaccuracies when validating the output. For example, a name and an identifier in the document were returned incorrectly.
These were small errors, but when extracting names or other identifiers, even minor inaccuracies make the results unreliable for downstream processing.
To test a different approach, I switched to PyMuPDF to extract the text from the PDF. This produced clean and accurate text from the document.
Once I had reliable text, I used ai_query to extract the fields I was interested in. I wrote a prompt asking the model to extract four specific fields from the document text and return the results in JSON format.
This part worked very well and produced consistent results.
The workflow that ended up being reliable looked like this:
In my testing, ai_query performed well once the document text was accurate. However, ai_parse_document was not reliable enough for precise extraction of names and identifiers.
I’m curious if others have experimented with ai_parse_document and have found ways to improve accuracy when extracting text from PDFs for structured data workflows.
2 weeks ago
Hi @FreshBrewedData .
In my experience the difference comes from how the two approaches work internally.
ai_parse_document relies on a multimodal AI model to interpret the document and reconstruct its structure (paragraphs, tables, layout elements, etc.). Because it uses a generative model, small transcription inaccuracies can occasionally appear in the extracted text, which can be problematic when working with precise identifiers like names or IDs.
Libraries like PyMuPDF, on the other hand, perform deterministic parsing of the PDF text layer. If the document already contains embedded text (not just scanned images), the extracted content is usually identical to the original and therefore more reliable for downstream processing.
However, this approach only works when the PDF contains a real text layer. If the document is scanned and essentially stored as images, PyMuPDF will not be able to extract the text, while ai_parse_document can still work because it performs OCR and visual document understanding.
For this reason, a hybrid workflow often works well:
PDF → PyMuPDF (text extraction when text layer exists) → ai_query (structured field extraction) → JSON
In cases where the document is scanned or layout-heavy, using ai_parse_document directly can be more effective.
Curious to hear if others have tested similar pipelines or compared this with ai_parse_document on scanned PDFs or complex layouts.
2 weeks ago
I didn’t know ai_parse_document works better with scanned images. Thanks for sharing!
2 weeks ago
It works also as OCR and this is something that PyMuPDF can’t do. So if you have a scanned document, ai_pars is the only solution. Of course it’s an LLM so i can do mistake
Thursday
Great observations — this is a pattern several of us have run into. The short answer is: your PyMuPDF + ai_query workflow is the right approach for digitally-born PDFs, and here's why.
ai_parse_document uses an OCR + Vision Language Model pipeline under the hood — it's rendering pages as images and using AI to "read" them. This is powerful for scanned documents, complex tables, and figures, but it introduces non-determinism. The model is essentially interpreting the text visually rather than extracting the actual embedded text data. For names and identifiers — where a single character error matters — this visual interpretation can produce subtle hallucinations (e.g., swapping similar-looking characters, misspelling proper nouns).
The function is also explicitly non-deterministic per the docs — running it twice on the same PDF may yield slightly different results.
For digitally-born PDFs (not scanned), the text is embedded as actual character data in the PDF. PyMuPDF extracts this data directly — no AI interpretation needed — so it's deterministic and lossless. Then passing that clean text to ai_query for structured field extraction gives the LLM accurate input to work with.
Your workflow is actually the recommended two-stage pattern, just with a more reliable first stage for your document type:
PDF → PyMuPDF (deterministic text) → ai_query (structured extraction) → JSON
|
Scenario |
Best approach |
|
**Digital/native PDFs** (text-selectable) |
PyMuPDF + ai_query — cheaper, faster, deterministic |
|
Scanned documents / images |
ai_parse_document + ai_query — OCR is necessary |
|
Complex tables / layouts |
ai_parse_document — extracts tables as HTML with structure |
|
Figures / charts needing descriptions |
ai_parse_document with `descriptionElementTypes` = `'*'` |
If your pipeline must handle both digital and scanned PDFs:
SELECT ai_parse_document(
content,
map(
'imageOutputPath', '/Volumes/catalog/schema/volume/debug_images/',
'descriptionElementTypes', '*'
)
) FROM ...
import pymupdf
def extract_text(pdf_bytes):
"""Try PyMuPDF first (digital PDFs), flag for ai_parse_document if empty."""
doc = pymupdf.open(stream=pdf_bytes, filetype="pdf")
text = "\n".join(page.get_text() for page in doc)
return text.strip() if text.strip() else None
# In your pipeline:
# 1. Run PyMuPDF on all docs
# 2. For docs where extract_text returns None → use ai_parse_document
# 3. Pass all extracted text to ai_query for structured extraction
Your finding aligns with what others have observed — ai_parse_document is best suited for documents that require AI-based parsing (scans, complex layouts), while digitally-born PDFs are better served by direct text extraction libraries.