ai_parse_document Not Extracting Text from Images in PDF

rajcoder — Thu, 04 Dec 2025 06:04:23 GMT

Hello Team,

I hope you are doing well.

I am a student currently exploring Databricks and learning how to work with the "ai parse document" function. While experimenting, I encountered a couple of issues related to text extraction from images inside PDF files. I wanted to share the details along with the code snippets I used.

1. Text not extracted from all images in a PDF

I tested a PDF that contains two images, and each image has text inside it.
However, "ai parse document" extracts text from only one of the images.
The text from the second image is not extracted at all.

2. Images ignored in PDFs containing images + paragraphs

In another PDF containing both paragraph text and multiple images, the function extracts the paragraph text correctly, but no text is extracted from images.

Code Snippet Used

%sql WITH parsed_documents AS ( SELECT path, ai_parse_document( content, map( 'imageOutputPath', '/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/', 'descriptionElementTypes', '*' ) ) AS parsed FROM READ_FILES('/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/pdf_with_two_images_part4.pdf', format => 'binaryFile') ), parsed_text AS ( SELECT path, concat_ws( '\n\n', transform( try_cast(parsed:document:elements AS ARRAY<STRING>), element -> try_cast(element:content AS STRING) ) ) AS text FROM parsed_documents WHERE try_cast(parsed:error_status AS STRING) IS NULL ) SELECT path, text, ai_query( 'databricks-meta-llama-3-3-70b-instruct', concat( 'Extract the following information from the document ', text ), returnType => 'STRING' ) AS structured_data FROM parsed_text WHERE text IS NOT NULL;

Attachments

I have also attached the PDF files used for testing.

I kindly request your guidance on why text inside images is not being fully extracted and whether there are additional configurations needed.

Thank you very much for your support.

Warm regards,
Raj

Re: ai_parse_document Not Extracting Text from Images in PDF

Advika — Thu, 04 Dec 2025 10:12:42 GMT

Hello @rajcoder!