<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Testing ai_parse_document vs PyMuPDF for PDF extraction in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151095#M1708</link>
    <description>&lt;P&gt;I didn’t know ai_parse_document works better with scanned images. Thanks for sharing!&lt;/P&gt;</description>
    <pubDate>Tue, 17 Mar 2026 00:39:28 GMT</pubDate>
    <dc:creator>FreshBrewedData</dc:creator>
    <dc:date>2026-03-17T00:39:28Z</dc:date>
    <item>
      <title>Testing ai_parse_document vs PyMuPDF for PDF extraction</title>
      <link>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151056#M1705</link>
      <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;P&gt;I’ve been experimenting with the Databricks AI functions and recently ran a small test extracting structured information from a PDF document.&lt;/P&gt;&lt;P&gt;My initial approach was to use &lt;STRONG&gt;ai_parse_document&lt;/STRONG&gt; to extract the text from the PDF.&lt;/P&gt;&lt;P&gt;While the function appeared to work at first glance, I noticed some transcription inaccuracies when validating the output. For example, a name and an identifier in the document were returned incorrectly.&lt;/P&gt;&lt;P&gt;These were small errors, but when extracting names or other identifiers, even minor inaccuracies make the results unreliable for downstream processing.&lt;/P&gt;&lt;P&gt;To test a different approach, I switched to &lt;STRONG&gt;PyMuPDF&lt;/STRONG&gt; to extract the text from the PDF. This produced clean and accurate text from the document.&lt;/P&gt;&lt;P&gt;Once I had reliable text, I used &lt;STRONG&gt;ai_query&lt;/STRONG&gt; to extract the fields I was interested in. I wrote a prompt asking the model to extract four specific fields from the document text and return the results in JSON format.&lt;/P&gt;&lt;P&gt;This part worked very well and produced consistent results.&lt;/P&gt;&lt;P&gt;The workflow that ended up being reliable looked like this:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;STRONG&gt;PDF&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;→ PyMuPDF (text extraction)&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;→ ai_query (field extraction)&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;→ JSON output&lt;/STRONG&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;In my testing, &lt;STRONG&gt;ai_query&lt;/STRONG&gt; performed well once the document text was accurate. However, &lt;STRONG&gt;ai_parse_document &lt;/STRONG&gt;was not reliable enough for precise extraction of names and identifiers.&lt;/P&gt;&lt;P&gt;I’m curious if others have experimented with &lt;STRONG&gt;ai_parse_document&lt;/STRONG&gt; and have found ways to improve accuracy when extracting text from PDFs for structured data workflows.&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Mon, 16 Mar 2026 14:43:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151056#M1705</guid>
      <dc:creator>FreshBrewedData</dc:creator>
      <dc:date>2026-03-16T14:43:24Z</dc:date>
    </item>
    <item>
      <title>Re: Testing ai_parse_document vs PyMuPDF for PDF extraction</title>
      <link>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151093#M1707</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/215576"&gt;@FreshBrewedData&lt;/a&gt;&amp;nbsp;.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In my experience the difference comes from how the two approaches work internally.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;ai_parse_document&lt;/STRONG&gt; relies on a multimodal AI model to interpret the document and reconstruct its structure (paragraphs, tables, layout elements, etc.). Because it uses a generative model, small transcription inaccuracies can occasionally appear in the extracted text, which can be problematic when working with precise identifiers like names or IDs.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Libraries like &lt;STRONG&gt;PyMuPDF&lt;/STRONG&gt;, on the other hand, perform deterministic parsing of the PDF text layer. If the document already contains embedded text (not just scanned images), the extracted content is usually identical to the original and therefore more reliable for downstream processing.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However, this approach only works when the PDF contains a real text layer. If the document is scanned and essentially stored as images, PyMuPDF will not be able to extract the text, while ai_parse_document can still work because it performs OCR and visual document understanding.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;For this reason, a hybrid workflow often works well:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;PDF → PyMuPDF (text extraction when text layer exists) → ai_query (structured field extraction) → JSON&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In cases where the document is scanned or layout-heavy, using ai_parse_document directly can be more effective.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Curious to hear if others have tested similar pipelines or compared this with ai_parse_document on scanned PDFs or complex layouts.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Mar 2026 00:07:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151093#M1707</guid>
      <dc:creator>Ale_Armillotta</dc:creator>
      <dc:date>2026-03-17T00:07:44Z</dc:date>
    </item>
    <item>
      <title>Re: Testing ai_parse_document vs PyMuPDF for PDF extraction</title>
      <link>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151095#M1708</link>
      <description>&lt;P&gt;I didn’t know ai_parse_document works better with scanned images. Thanks for sharing!&lt;/P&gt;</description>
      <pubDate>Tue, 17 Mar 2026 00:39:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151095#M1708</guid>
      <dc:creator>FreshBrewedData</dc:creator>
      <dc:date>2026-03-17T00:39:28Z</dc:date>
    </item>
    <item>
      <title>Re: Testing ai_parse_document vs PyMuPDF for PDF extraction</title>
      <link>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151113#M1709</link>
      <description>&lt;P&gt;It works also as OCR and this is something that PyMuPDF can’t do. So if you have a scanned document, ai_pars is the only solution. Of course it’s an LLM so i can do mistake&lt;/P&gt;</description>
      <pubDate>Tue, 17 Mar 2026 07:57:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/151113#M1709</guid>
      <dc:creator>Ale_Armillotta</dc:creator>
      <dc:date>2026-03-17T07:57:03Z</dc:date>
    </item>
    <item>
      <title>Re: Testing ai_parse_document vs PyMuPDF for PDF extraction</title>
      <link>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/152200#M1728</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Great observations — this is a pattern several of us have run into. The short answer is: &lt;/SPAN&gt;&lt;STRONG&gt;your PyMuPDF + ai_query workflow is the right approach for digitally-born PDFs&lt;/STRONG&gt;&lt;SPAN&gt;, and here's why.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Why ai&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;parse&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;document can get names/identifiers wrong&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;ai_parse_document&lt;/SPAN&gt;&lt;SPAN&gt; uses an &lt;/SPAN&gt;&lt;STRONG&gt;OCR + Vision Language Model&lt;/STRONG&gt;&lt;SPAN&gt; pipeline under the hood — it's rendering pages as images and using AI to "read" them. This is powerful for scanned documents, complex tables, and figures, but it introduces non-determinism. The model is essentially &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;interpreting&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; the text visually rather than extracting the actual embedded text data. For names and identifiers — where a single character error matters — this visual interpretation can produce subtle hallucinations (e.g., swapping similar-looking characters, misspelling proper nouns).&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The function is also explicitly &lt;/SPAN&gt;&lt;STRONG&gt;non-deterministic&lt;/STRONG&gt;&lt;SPAN&gt; per the docs — running it twice on the same PDF may yield slightly different results.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Why PyMuPDF + ai_query works better here&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;For &lt;/SPAN&gt;&lt;STRONG&gt;digitally-born PDFs&lt;/STRONG&gt;&lt;SPAN&gt; (not scanned), the text is embedded as actual character data in the PDF. PyMuPDF extracts this data directly — no AI interpretation needed — so it's &lt;/SPAN&gt;&lt;STRONG&gt;deterministic and lossless&lt;/STRONG&gt;&lt;SPAN&gt;. Then passing that clean text to &lt;/SPAN&gt;&lt;SPAN&gt;ai_query&lt;/SPAN&gt;&lt;SPAN&gt; for structured field extraction gives the LLM accurate input to work with.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Your workflow is actually the recommended two-stage pattern, just with a more reliable first stage for your document type:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;PDF → PyMuPDF (deterministic text) → ai_query (structured extraction) → JSON&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;When to use which&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Scenario&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Best approach&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;**Digital/native PDFs** (text-selectable)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;PyMuPDF + ai_query — cheaper, faster, deterministic&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Scanned documents / images&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;ai_parse_document + ai_query — OCR is necessary&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Complex tables / layouts&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;ai_parse_document — extracts tables as HTML with structure&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Figures / charts needing descriptions&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;ai_parse_document with `descriptionElementTypes` = `'*'`&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H2&gt;&lt;SPAN&gt;Tips if you do need ai&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;parse&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;document&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;If your pipeline must handle both digital and scanned PDFs:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Enable `imageOutputPath`&lt;/STRONG&gt;&lt;SPAN&gt; to visually inspect what the model received — this helps diagnose whether the issue is input quality or model interpretation:&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;SPAN&gt;SELECT ai_parse_document(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;content,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;map(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;'imageOutputPath', '/Volumes/catalog/schema/volume/debug_images/',&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;'descriptionElementTypes', '*'&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;) FROM ...&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Add a validation step&lt;/STRONG&gt;&lt;SPAN&gt; — since there are no confidence scores, implement your own (regex for expected ID formats, cross-reference against known values).&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Consider a hybrid approach&lt;/STRONG&gt;&lt;SPAN&gt; — try PyMuPDF first; if it returns empty/garbled text (indicating a scanned PDF), fall back to &lt;/SPAN&gt;&lt;SPAN&gt;ai_parse_document&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2&gt;&lt;SPAN&gt;Example: Hybrid pipeline in PySpark&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;import pymupdf&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;def extract_text(pdf_bytes):&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"""Try PyMuPDF first (digital PDFs), flag for ai_parse_document if empty."""&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;doc = pymupdf.open(stream=pdf_bytes, filetype="pdf")&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;text = "\n".join(page.get_text() for page in doc)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;return text.strip() if text.strip() else None&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# In your pipeline:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# 1. Run PyMuPDF on all docs&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# 2. For docs where extract_text returns None → use ai_parse_document&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# 3. Pass all extracted text to ai_query for structured extraction&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;SPAN&gt;Your finding aligns with what others have observed — &lt;/SPAN&gt;&lt;SPAN&gt;ai_parse_document&lt;/SPAN&gt;&lt;SPAN&gt; is best suited for documents that &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;require&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; AI-based parsing (scans, complex layouts), while digitally-born PDFs are better served by direct text extraction libraries.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Mar 2026 16:57:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/testing-ai-parse-document-vs-pymupdf-for-pdf-extraction/m-p/152200#M1728</guid>
      <dc:creator>anuj_lathi</dc:creator>
      <dc:date>2026-03-26T16:57:32Z</dc:date>
    </item>
  </channel>
</rss>

