cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

ai_parse_document struggling to detect pdf

JN_Bristol
Contributor

Hi helpful experts ๐ŸŒŸ

I'm writing my first PySpark Notebook that makes use of the new `ai_parse_document` function.  I am basically following the code example from here: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document

(and doing it on Azure Databricks, if that helps)

My code:

from pyspark.sql.functions import ai_parse_document

volume_path = '/Volumes/gen_ai/bank_statements/raw_pdfs'

raw_pdfs = (
    spark.read
    .format('binaryFile')
    .load(f'{volume_path}/*.pdf')
)

# this line works fine... I can see 'length' = 332159 and 'content' is binary
raw_pdfs.display()

# this line runs ok... but the output is in the 'corrupted_data' property
parsed_pdfs = (
    raw_pdfs
    .withColumn(
        'content_parsed',
        ai_parse_document('content')
    )
)

 The error message is:

error_message: "[UNSTRUCTURED_DATA_PROCESSING_UNSUPPORTED_FILE_FORMAT] Unstructured file format detected: unknown is not supported. Supported file formats are auto, pdf, doc, docx, ppt, pptx, png, jpg, jpeg.\nPlease update the `format` from your ai function expression to one of the supported formats and then retry the query again. SQLSTATE: 0A000"

And yet the file _is_ a PDF.  I downloaded it from my bank, and can open it fine in Acrobat and other tools.  So I don't think it's the file that can be corrupted? ๐Ÿค”

Does anyone know what the error message means by "update the format from your ai function expression"?  I can't see a parameter for that in the ai_parse_document documentation.

Alternatively, are there some PDFs that this (beta) function just can't handle yet?

Any advice much appreciated ๐Ÿ™๐Ÿป

4 REPLIES 4

Vinay_M_R
Databricks Employee
Databricks Employee

Hello @JN_Bristol ,

There are some limitations while using `ai_parse_document` function:

1.) While Databricks is continuously working to improve all of its features, LLMs are an emerging technology and may produce errors.

2.) The ai_parse_document function can take time to extract document content while preserving structural information, especially for documents that contain highly dense content or content with poor resolution. In some cases, the function may take a while to run or ignore content. Databricks is continuously working to improve latency.

I am sharing official documentation for your reference: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#l...

Suggestion:

Also your input data files must be stored as blob data in bytes, meaning a binary type column in a dataframe or Delta table. As your source documents are stored in a Unity Catalog volume, can you generate binary type column  using Spark binaryFile format reader?

I am sharing official documentation for your reference: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#-...

 

Hi @Vinay_M_R 

Thanks for replying.  The docs link is the same as the link that I included in my original post - and it is where I am following the code examples from.  That example shows a pdf being read from a Volume - but are you saying I should not do this and should read directly from a Blob store instead? ๐Ÿค”  I thought the Databricks position was that Volumes are the way forward?

Sharanya13
Contributor III

@JN_Bristol Can you describe the PDF document (size, contents) or share it? I have mixed experience with ai_parse

Hi @Sharanya13 

It's an actual bank statement (not dummy data)... so, alas no, I cannot share it ๐Ÿ˜  It's 6 pages, and contains a mixture of tables, graphics, and summary small print.

Are you suggesting I try "ai_parse" instead of "ai_parse_document"?  ok, I'll give that a go ๐Ÿ™๐Ÿป

Thanks ๐Ÿ™‚

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now