topic Re: Issue with ai_parse_document Not Extracting Text from Images in PDF in Generative AI

Issue with ai_parse_document Not Extracting Text from Images in PDF

rajcoder — Thu, 04 Dec 2025 06:10:55 GMT

Hello Team,

I hope you are doing well.

I am a student currently exploring Databricks and learning how to work with the "ai parse document" function. While experimenting, I encountered a couple of issues related to text extraction from images inside PDF files. I wanted to share the details along with the code snippets I used.

1. Text not extracted from all images in a PDF

I tested a PDF that contains two images, and each image has text inside it.
However, "ai parse document" extracts text from only one of the images.
The text from the second image is not extracted at all.

2. Images ignored in PDFs containing images + paragraphs

In another PDF containing both paragraph text and multiple images, the function extracts the paragraph text correctly, but no text is extracted from images.

Code Snippet Used

%sql WITH parsed_documents AS ( SELECT path, ai_parse_document( content, map( 'imageOutputPath', '/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/', 'descriptionElementTypes', '*' ) ) AS parsed FROM READ_FILES('/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/pdf_with_two_images_part4.pdf', format => 'binaryFile') ), parsed_text AS ( SELECT path, concat_ws( '\n\n', transform( try_cast(parsed:document:elements AS ARRAY<STRING>), element -> try_cast(element:content AS STRING) ) ) AS text FROM parsed_documents WHERE try_cast(parsed:error_status AS STRING) IS NULL ) SELECT path, text, ai_query( 'databricks-meta-llama-3-3-70b-instruct', concat( 'Extract the following information from the document ', text ), returnType => 'STRING' ) AS structured_data FROM parsed_text WHERE text IS NOT NULL;

Attachments

I have also attached the PDF files used for testing.

I kindly request your guidance on why text inside images is not being fully extracted and whether there are additional configurations needed.

Thank you very much for your support.

Warm regards,
Raj

Re: Issue with ai_parse_document Not Extracting Text from Images in PDF

Hubert-Dudek — Thu, 04 Dec 2025 09:51:31 GMT

I explained how I processed PDFS in that article https://databrickster.medium.com/ai-parse-document-get-your-pdf-invoices-into-the-database-05565d3fa8a1

Re: Issue with ai_parse_document Not Extracting Text from Images in PDF

rajcoder — Thu, 04 Dec 2025 10:18:31 GMT

Thank you for your reply!

Yes, I have gone through your article — it explains very well how to extract text content from PDFs. However, I am facing a different issue.

In my case, the PDF contains multiple images and paragraphs, but "ai_parse_document" is only able to extract the paragraph text. The images in the PDF (which also contain text inside them) are not being extracted or parsed at all.

Just wanted to clarify that the issue is specifically related to handling images inside PDFs with text, not regular PDF text extraction.

Thank you again for your guidance!

Re: Issue with ai_parse_document Not Extracting Text from Images in PDF

szymon_dybczak — Thu, 04 Dec 2025 10:42:02 GMT

Hi @rajcoder ,

It can happen. In theory it should work but keep in mind this feature is still on preview and has following limitations:

As you can see they've mentioned that sometimes function can ignore content (especially for documents that contain highly dense content or content with poor resoliution).

Moreover, there's nothing you can do to imporove that situation because customizing the model that powers this function is not supported.