topic Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu in Data Engineering

"ai_parse_document()" is not a full OCR engine ? It's not extracting text from high quality image

radha_krishna — Thu, 04 Dec 2025 11:02:16 GMT

I used "ai_parse_document()" to parse a PNG file that contains cat images and text. From the image, I wanted to extract all the cat names, but the response returned nothing. It seems that "ai_parse_document()" does not support rich image extraction. Am i right?

%sql WITH parsed_documents AS ( SELECT path, ai_parse_document( content, map( 'imageOutputPath', '/Volumes/vector_search1/00_landing/volume1/', 'descriptionElementTypes', '*' ) ) AS parsed FROM READ_FILES('/Volumes/vector_search1/00_landing/volume1/cat names.png', format => 'binaryFile') ), parsed_text AS ( SELECT path, concat_ws( '\n\n', transform( try_cast(parsed:document:elements AS ARRAY<STRING>), element -> try_cast(element:content AS STRING) ) ) AS text FROM parsed_documents WHERE try_cast(parsed:error_status AS STRING) IS NULL ) SELECT path, text, ai_query( 'databricks-meta-llama-3-3-70b-instruct', concat( 'Extract the following information from the document ', text ), returnType => 'STRING' ) AS structured_data FROM parsed_text WHERE text IS NOT NULL;

Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu

bianca_unifeye — Thu, 04 Dec 2025 17:39:52 GMT

https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document

The following file formats are supported:

PDF
JPG / JPEG
PNG
DOC/DOCX
PPT/PPTX

Personally, I have tested it with pdf files with over 350 pages and worked well. PNG is on the list, so it is something within the code. I will debug.

Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu

szymon_dybczak — Thu, 04 Dec 2025 17:56:19 GMT

It doesn't have to be a bug within code. This function still relies on ai models and there's no guarantee that it will work correctly every time. It's mentioned in limitations section that ai_parse_document can even ignore some content.

Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu

bianca_unifeye — Thu, 04 Dec 2025 18:07:16 GMT

I think that the function does not work for your “Cats Name” PNG because relies on OCR.
Your image is a graphic with drawings and stylized text, so OCR finds no readable text, and the function returns nothing.

The code shared is fine.

Re: "ai_parse_document()" is not a full OCR engine ? It's not extracting text from high qu

Raman_Unifeye — Thu, 04 Dec 2025 22:51:37 GMT

@szymon_dybczak - yes, as it relies on AI models, there are chances of missing few cases due to non-deterministic nature of it. I have used it with vast number of PDFs in anger and it has worked pretty well in all those cases. Have not tried with PNGs.

@radha_krishna - As Bianca mentioned above, does not seem an error in the code with first glance though.