cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Testing ai_parse_document vs PyMuPDF for PDF extraction

FreshBrewedData
New Contributor II

I’ve been experimenting with the Databricks AI functions and recently ran a small test extracting structured information from a PDF document.

My initial approach was to use ai_parse_document to extract the text from the PDF.

While the function appeared to work at first glance, I noticed some transcription inaccuracies when validating the output. For example, a name and an identifier in the document were returned incorrectly.

These were small errors, but when extracting names or other identifiers, even minor inaccuracies make the results unreliable for downstream processing.

To test a different approach, I switched to PyMuPDF to extract the text from the PDF. This produced clean and accurate text from the document.

Once I had reliable text, I used ai_query to extract the fields I was interested in. I wrote a prompt asking the model to extract four specific fields from the document text and return the results in JSON format.

This part worked very well and produced consistent results.

The workflow that ended up being reliable looked like this:

 
PDF
→ PyMuPDF (text extraction)
→ ai_query (field extraction)
→ JSON output
 

In my testing, ai_query performed well once the document text was accurate. However, ai_parse_document was not reliable enough for precise extraction of names and identifiers.

I’m curious if others have experimented with ai_parse_document and have found ways to improve accuracy when extracting text from PDFs for structured data workflows.

 
 
4 REPLIES 4

Ale_Armillotta
Contributor III

Hi @FreshBrewedData .

 

In my experience the difference comes from how the two approaches work internally.

ai_parse_document relies on a multimodal AI model to interpret the document and reconstruct its structure (paragraphs, tables, layout elements, etc.). Because it uses a generative model, small transcription inaccuracies can occasionally appear in the extracted text, which can be problematic when working with precise identifiers like names or IDs.

 

Libraries like PyMuPDF, on the other hand, perform deterministic parsing of the PDF text layer. If the document already contains embedded text (not just scanned images), the extracted content is usually identical to the original and therefore more reliable for downstream processing.

However, this approach only works when the PDF contains a real text layer. If the document is scanned and essentially stored as images, PyMuPDF will not be able to extract the text, while ai_parse_document can still work because it performs OCR and visual document understanding.

For this reason, a hybrid workflow often works well:

PDF → PyMuPDF (text extraction when text layer exists) → ai_query (structured field extraction) → JSON

In cases where the document is scanned or layout-heavy, using ai_parse_document directly can be more effective.

 

Curious to hear if others have tested similar pipelines or compared this with ai_parse_document on scanned PDFs or complex layouts.

I didn’t know ai_parse_document works better with scanned images. Thanks for sharing!

It works also as OCR and this is something that PyMuPDF can’t do. So if you have a scanned document, ai_pars is the only solution. Of course it’s an LLM so i can do mistake

anuj_lathi
Databricks Employee
Databricks Employee

Great observations — this is a pattern several of us have run into. The short answer is: your PyMuPDF + ai_query workflow is the right approach for digitally-born PDFs, and here's why.

Why aiparsedocument can get names/identifiers wrong

ai_parse_document uses an OCR + Vision Language Model pipeline under the hood — it's rendering pages as images and using AI to "read" them. This is powerful for scanned documents, complex tables, and figures, but it introduces non-determinism. The model is essentially interpreting the text visually rather than extracting the actual embedded text data. For names and identifiers — where a single character error matters — this visual interpretation can produce subtle hallucinations (e.g., swapping similar-looking characters, misspelling proper nouns).

The function is also explicitly non-deterministic per the docs — running it twice on the same PDF may yield slightly different results.

Why PyMuPDF + ai_query works better here

For digitally-born PDFs (not scanned), the text is embedded as actual character data in the PDF. PyMuPDF extracts this data directly — no AI interpretation needed — so it's deterministic and lossless. Then passing that clean text to ai_query for structured field extraction gives the LLM accurate input to work with.

Your workflow is actually the recommended two-stage pattern, just with a more reliable first stage for your document type:

PDF → PyMuPDF (deterministic text) → ai_query (structured extraction) → JSON

 

When to use which

 

Scenario

Best approach

**Digital/native PDFs** (text-selectable)

PyMuPDF + ai_query — cheaper, faster, deterministic

Scanned documents / images

ai_parse_document + ai_query — OCR is necessary

Complex tables / layouts

ai_parse_document — extracts tables as HTML with structure

Figures / charts needing descriptions

ai_parse_document with `descriptionElementTypes` = `'*'`

Tips if you do need aiparsedocument

If your pipeline must handle both digital and scanned PDFs:

  1. Enable `imageOutputPath` to visually inspect what the model received — this helps diagnose whether the issue is input quality or model interpretation:

SELECT ai_parse_document(

  content,

  map(

    'imageOutputPath', '/Volumes/catalog/schema/volume/debug_images/',

    'descriptionElementTypes', '*'

  )

) FROM ...

 

  1. Add a validation step — since there are no confidence scores, implement your own (regex for expected ID formats, cross-reference against known values).
  2. Consider a hybrid approach — try PyMuPDF first; if it returns empty/garbled text (indicating a scanned PDF), fall back to ai_parse_document.

Example: Hybrid pipeline in PySpark

import pymupdf

 

def extract_text(pdf_bytes):

    """Try PyMuPDF first (digital PDFs), flag for ai_parse_document if empty."""

    doc = pymupdf.open(stream=pdf_bytes, filetype="pdf")

    text = "\n".join(page.get_text() for page in doc)

    return text.strip() if text.strip() else None

 

# In your pipeline:

# 1. Run PyMuPDF on all docs

# 2. For docs where extract_text returns None → use ai_parse_document

# 3. Pass all extracted text to ai_query for structured extraction


Your finding aligns with what others have observed — ai_parse_document is best suited for documents that require AI-based parsing (scans, complex layouts), while digitally-born PDFs are better served by direct text extraction libraries.

Anuj Lathi
Solutions Engineer @ Databricks