<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: &amp;quot;ai_parse_document()&amp;quot; is not a full OCR engine ? It's not extracting text from high qu in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141186#M51653</link>
    <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document" target="_blank"&gt;https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The following file formats are supported:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;PDF&lt;/LI&gt;&lt;LI&gt;JPG / JPEG&lt;/LI&gt;&lt;LI&gt;PNG&lt;/LI&gt;&lt;LI&gt;DOC/DOCX&lt;/LI&gt;&lt;LI&gt;PPT/PPTX&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Personally, I have tested it with pdf files with over 350 pages and worked well. PNG is on the list, so it is something within the code. I will debug.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 04 Dec 2025 17:39:52 GMT</pubDate>
    <dc:creator>bianca_unifeye</dc:creator>
    <dc:date>2025-12-04T17:39:52Z</dc:date>
    <item>
      <title>"ai_parse_document()" is not a full OCR engine ? It's not extracting text from high quality image</title>
      <link>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141139#M51629</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;SPAN&gt;I used "&lt;SPAN&gt;ai_parse_document()"&lt;SPAN&gt; to parse a PNG file that contains cat images and text. From the image, I wanted to extract all the cat names, but the response returned nothing. It seems that "&lt;SPAN&gt;ai_parse_document()"&lt;SPAN&gt; does not support rich image extraction. Am i right?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;%sql
WITH parsed_documents AS (
    SELECT
      path,
      ai_parse_document(
        content,
        map(
          'imageOutputPath', '/Volumes/vector_search1/00_landing/volume1/',
          'descriptionElementTypes', '*'
        )
      ) AS parsed
    FROM READ_FILES('/Volumes/vector_search1/00_landing/volume1/cat names.png', format =&amp;gt; 'binaryFile')
  ),
  parsed_text AS (
    SELECT
      path,
      concat_ws(
        '\n\n',
        transform(
          try_cast(parsed:document:elements AS ARRAY&amp;lt;STRING&amp;gt;),
          element -&amp;gt; try_cast(element:content AS STRING)
        )
      ) AS text
    FROM parsed_documents
    WHERE try_cast(parsed:error_status AS STRING) IS NULL
  )
  SELECT
    path,
    text,
    ai_query(
      'databricks-meta-llama-3-3-70b-instruct',
      concat(
        'Extract the following information from the document  ',
        text
      ),
      returnType =&amp;gt; 'STRING'
    ) AS structured_data
  FROM parsed_text
  WHERE text IS NOT NULL;&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 11:02:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141139#M51629</guid>
      <dc:creator>radha_krishna</dc:creator>
      <dc:date>2025-12-04T11:02:16Z</dc:date>
    </item>
    <item>
      <title>Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu</title>
      <link>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141186#M51653</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document" target="_blank"&gt;https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The following file formats are supported:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;PDF&lt;/LI&gt;&lt;LI&gt;JPG / JPEG&lt;/LI&gt;&lt;LI&gt;PNG&lt;/LI&gt;&lt;LI&gt;DOC/DOCX&lt;/LI&gt;&lt;LI&gt;PPT/PPTX&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Personally, I have tested it with pdf files with over 350 pages and worked well. PNG is on the list, so it is something within the code. I will debug.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 17:39:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141186#M51653</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-12-04T17:39:52Z</dc:date>
    </item>
    <item>
      <title>Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu</title>
      <link>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141189#M51654</link>
      <description>&lt;P&gt;It doesn't have to be a bug within code. This function still relies on ai models and there's no guarantee that it will work correctly every time. It's mentioned in limitations section that&amp;nbsp;ai_parse_document can even ignore some content.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1764870864268.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/22122iAFDE674819542068/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1764870864268.png" alt="szymon_dybczak_0-1764870864268.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 17:56:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141189#M51654</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-12-04T17:56:19Z</dc:date>
    </item>
    <item>
      <title>Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu</title>
      <link>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141190#M51655</link>
      <description>&lt;P&gt;I think that the function does &lt;STRONG&gt;not&lt;/STRONG&gt; work for your “Cats Name” PNG because relies on &lt;STRONG&gt;OCR&lt;/STRONG&gt;.&lt;BR /&gt;Your image is a &lt;STRONG&gt;graphic with drawings and stylized text&lt;/STRONG&gt;, so OCR finds &lt;STRONG&gt;no readable text&lt;/STRONG&gt;, and the function returns nothing.&amp;nbsp;&lt;/P&gt;&lt;P&gt;The code shared is fine.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 18:07:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141190#M51655</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-12-04T18:07:16Z</dc:date>
    </item>
    <item>
      <title>Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu</title>
      <link>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141202#M51660</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;- yes, as it relies on AI models, there are chances of missing few cases due to non-deterministic nature of it. I have used it with vast number of PDFs in anger and it has worked pretty well in all those cases. Have not tried with PNGs.&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/199762"&gt;@radha_krishna&lt;/a&gt;&amp;nbsp; - As Bianca mentioned above, does not seem an error in the code with first glance though.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2025 22:51:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/quot-ai-parse-document-quot-is-not-a-full-ocr-engine-it-s-not/m-p/141202#M51660</guid>
      <dc:creator>Raman_Unifeye</dc:creator>
      <dc:date>2025-12-04T22:51:37Z</dc:date>
    </item>
  </channel>
</rss>

