Hello Team,
I hope you are doing well.
I am a student currently exploring Databricks and learning how to work with the "ai parse document" function. While experimenting, I encountered a couple of issues related to text extraction from images inside PDF files. I wanted to share the details along with the code snippets I used.
1. Text not extracted from all images in a PDF
I tested a PDF that contains two images, and each image has text inside it.
However, "ai parse document" extracts text from only one of the images.
The text from the second image is not extracted at all.
2. Images ignored in PDFs containing images + paragraphs
In another PDF containing both paragraph text and multiple images, the function extracts the paragraph text correctly, but no text is extracted from images.
Code Snippet Used
%sql
WITH parsed_documents AS (
SELECT
path,
ai_parse_document(
content,
map(
'imageOutputPath', '/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/',
'descriptionElementTypes', '*'
)
) AS parsed
FROM READ_FILES('/Volumes/demo_raj_cat/demo_schema_cat/demo_volume_cat/demo_dir_cat/pdf_with_two_images_part4.pdf', format => 'binaryFile')
),
parsed_text AS (
SELECT
path,
concat_ws(
'\n\n',
transform(
try_cast(parsed:document:elements AS ARRAY<STRING>),
element -> try_cast(element:content AS STRING)
)
) AS text
FROM parsed_documents
WHERE try_cast(parsed:error_status AS STRING) IS NULL
)
SELECT
path,
text,
ai_query(
'databricks-meta-llama-3-3-70b-instruct',
concat(
'Extract the following information from the document ',
text
),
returnType => 'STRING'
) AS structured_data
FROM parsed_text
WHERE text IS NOT NULL;
Attachments
I have also attached the PDF files used for testing.
I kindly request your guidance on why text inside images is not being fully extracted and whether there are additional configurations needed.
Thank you very much for your support.
Warm regards,
Raj