Hi @FreshBrewedData .
In my experience the difference comes from how the two approaches work internally.
ai_parse_document relies on a multimodal AI model to interpret the document and reconstruct its structure (paragraphs, tables, layout elements, etc.). Because it uses a generative model, small transcription inaccuracies can occasionally appear in the extracted text, which can be problematic when working with precise identifiers like names or IDs.
Libraries like PyMuPDF, on the other hand, perform deterministic parsing of the PDF text layer. If the document already contains embedded text (not just scanned images), the extracted content is usually identical to the original and therefore more reliable for downstream processing.
However, this approach only works when the PDF contains a real text layer. If the document is scanned and essentially stored as images, PyMuPDF will not be able to extract the text, while ai_parse_document can still work because it performs OCR and visual document understanding.
For this reason, a hybrid workflow often works well:
PDF → PyMuPDF (text extraction when text layer exists) → ai_query (structured field extraction) → JSON
In cases where the document is scanned or layout-heavy, using ai_parse_document directly can be more effective.
Curious to hear if others have tested similar pipelines or compared this with ai_parse_document on scanned PDFs or complex layouts.