cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue with ai_parse_document Not Extracting Text from Images in PDF

rajcoder
Visitor

Hello Team,

I hope you are doing well.

I am a student currently exploring Databricks and learning how to work with the "ai parse document" function. While experimenting, I encountered a couple of issues related to text extraction from images inside PDF files. I wanted to share the details along with the code snippets I used.

1. Text not extracted from all images in a PDF

I tested a PDF that contains two images, and each image has text inside it.
However, "ai parse document" extracts text from only one of the images.
The text from the second image is not extracted at all.


2. Images ignored in PDFs containing images + paragraphs

In another PDF containing both paragraph text and multiple images, the function extracts the paragraph text correctly, but no text is extracted from images.

Code Snippet Used

%sql
WITH parsed_documents AS (
    SELECT
      path,
      ai_parse_document(
        content,
        map(
          'imageOutputPath', '/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/',
          'descriptionElementTypes', '*'
        )
      ) AS parsed
    FROM READ_FILES('/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/pdf_with_two_images_part4.pdf', format => 'binaryFile')
  ),
  parsed_text AS (
    SELECT
      path,
      concat_ws(
        '\n\n',
        transform(
          try_cast(parsed:document:elements AS ARRAY<STRING>),
          element -> try_cast(element:content AS STRING)
        )
      ) AS text
    FROM parsed_documents
    WHERE try_cast(parsed:error_status AS STRING) IS NULL
  )
  SELECT
    path,
    text,
    ai_query(
      'databricks-meta-llama-3-3-70b-instruct',
      concat(
        'Extract the following information from the document  ',
        text
      ),
      returnType => 'STRING'
    ) AS structured_data
  FROM parsed_text
  WHERE text IS NOT NULL;

Attachments

I have also attached the PDF files used for testing.

I kindly request your guidance on why text inside images is not being fully extracted and whether there are additional configurations needed.

Thank you very much for your support.

Warm regards,
Raj

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

rajcoder
Visitor

Thank you for your reply!

Yes, I have gone through your article — it explains very well how to extract text content from PDFs. However, I am facing a different issue.

In my case, the PDF contains multiple images and paragraphs, but "ai_parse_document" is only able to extract the paragraph text. The images in the PDF (which also contain text inside them) are not being extracted or parsed at all.

Just wanted to clarify that the issue is specifically related to handling images inside PDFs with text, not regular PDF text extraction.

Thank you again for your guidance!

szymon_dybczak
Esteemed Contributor III

Hi @rajcoder ,

It can happen. In theory it should work but keep in mind this feature is still on preview and has following limitations:

szymon_dybczak_0-1764844757658.png

As you can see they've mentioned that sometimes function can ignore content (especially for documents that contain highly dense content or content with poor resoliution).

Moreover, there's nothing you can do to imporove that situation because customizing the model that powers this function is not supported.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now