cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

"ai_parse_document()" is not a full OCR engine ? It's not extracting text from high quality image

radha_krishna
Visitor

 I used "ai_parse_document()" to parse a PNG file that contains cat images and text. From the image, I wanted to extract all the cat names, but the response returned nothing. It seems that "ai_parse_document()" does not support rich image extraction. Am i right?

%sql
WITH parsed_documents AS (
    SELECT
      path,
      ai_parse_document(
        content,
        map(
          'imageOutputPath', '/Volumes/vector_search1/00_landing/volume1/',
          'descriptionElementTypes', '*'
        )
      ) AS parsed
    FROM READ_FILES('/Volumes/vector_search1/00_landing/volume1/cat names.png', format => 'binaryFile')
  ),
  parsed_text AS (
    SELECT
      path,
      concat_ws(
        '\n\n',
        transform(
          try_cast(parsed:document:elements AS ARRAY<STRING>),
          element -> try_cast(element:content AS STRING)
        )
      ) AS text
    FROM parsed_documents
    WHERE try_cast(parsed:error_status AS STRING) IS NULL
  )
  SELECT
    path,
    text,
    ai_query(
      'databricks-meta-llama-3-3-70b-instruct',
      concat(
        'Extract the following information from the document  ',
        text
      ),
      returnType => 'STRING'
    ) AS structured_data
  FROM parsed_text
  WHERE text IS NOT NULL;

 

3 REPLIES 3

bianca_unifeye
New Contributor III

Hi

https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document

The following file formats are supported:

  • PDF
  • JPG / JPEG
  • PNG
  • DOC/DOCX
  • PPT/PPTX

Personally, I have tested it with pdf files with over 350 pages and worked well. PNG is on the list, so it is something within the code. I will debug. 

It doesn't have to be a bug within code. This function still relies on ai models and there's no guarantee that it will work correctly every time. It's mentioned in limitations section that ai_parse_document can even ignore some content.

szymon_dybczak_0-1764870864268.png

 

bianca_unifeye
New Contributor III

I think that the function does not work for your “Cats Name” PNG because relies on OCR.
Your image is a graphic with drawings and stylized text, so OCR finds no readable text, and the function returns nothing. 

The code shared is fine.