cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

ai_parse_document Not Extracting Text from Images in PDF

rajcoder
Visitor

Hello Team,

I hope you are doing well.

I am a student currently exploring Databricks and learning how to work with the "ai parse document" function. While experimenting, I encountered a couple of issues related to text extraction from images inside PDF files. I wanted to share the details along with the code snippets I used.

1. Text not extracted from all images in a PDF

I tested a PDF that contains two images, and each image has text inside it.
However, "ai parse document" extracts text from only one of the images.
The text from the second image is not extracted at all.


2. Images ignored in PDFs containing images + paragraphs

In another PDF containing both paragraph text and multiple images, the function extracts the paragraph text correctly, but no text is extracted from images.

Code Snippet Used

%sql
WITH parsed_documents AS (
    SELECT
      path,
      ai_parse_document(
        content,
        map(
          'imageOutputPath', '/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/',
          'descriptionElementTypes', '*'
        )
      ) AS parsed
    FROM READ_FILES('/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/pdf_with_two_images_part4.pdf', format => 'binaryFile')
  ),
  parsed_text AS (
    SELECT
      path,
      concat_ws(
        '\n\n',
        transform(
          try_cast(parsed:document:elements AS ARRAY<STRING>),
          element -> try_cast(element:content AS STRING)
        )
      ) AS text
    FROM parsed_documents
    WHERE try_cast(parsed:error_status AS STRING) IS NULL
  )
  SELECT
    path,
    text,
    ai_query(
      'databricks-meta-llama-3-3-70b-instruct',
      concat(
        'Extract the following information from the document  ',
        text
      ),
      returnType => 'STRING'
    ) AS structured_data
  FROM parsed_text
  WHERE text IS NOT NULL;

Attachments

I have also attached the PDF files used for testing.

I kindly request your guidance on why text inside images is not being fully extracted and whether there are additional configurations needed.

Thank you very much for your support.

Warm regards,
Raj

 

 

1 REPLY 1

Advika
Databricks Employee
Databricks Employee

Hello @rajcoder!

This post appears to duplicate the one you recently posted. A response has already been provided to your recent post. I recommend continuing the discussion in that thread to keep the conversation focused and organised.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now