topic Re: ai_parse_document struggling to detect pdf in Generative AI

ai_parse_document struggling to detect pdf

JN_Bristol — Sun, 03 Aug 2025 11:43:21 GMT

Hi helpful experts 🌟

I'm writing my first PySpark Notebook that makes use of the new `ai_parse_document` function. I am basically following the code example from here: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document

(and doing it on Azure Databricks, if that helps)

My code:

from pyspark.sql.functions import ai_parse_document volume_path = '/Volumes/gen_ai/bank_statements/raw_pdfs' raw_pdfs = ( spark.read .format('binaryFile') .load(f'{volume_path}/*.pdf') ) # this line works fine... I can see 'length' = 332159 and 'content' is binary raw_pdfs.display() # this line runs ok... but the output is in the 'corrupted_data' property parsed_pdfs = ( raw_pdfs .withColumn( 'content_parsed', ai_parse_document('content') ) )

The error message is:

error_message: "[UNSTRUCTURED_DATA_PROCESSING_UNSUPPORTED_FILE_FORMAT] Unstructured file format detected: unknown is not supported. Supported file formats are auto, pdf, doc, docx, ppt, pptx, png, jpg, jpeg.\nPlease update the `format` from your ai function expression to one of the supported formats and then retry the query again. SQLSTATE: 0A000"

And yet the file _is_ a PDF. I downloaded it from my bank, and can open it fine in Acrobat and other tools. So I don't think it's the file that can be corrupted? 🤔

Does anyone know what the error message means by "update the format from your ai function expression"? I can't see a parameter for that in the ai_parse_document documentation.

Alternatively, are there some PDFs that this (beta) function just can't handle yet?

Any advice much appreciated 🙏🏻

Re: ai_parse_document struggling to detect pdf

Vinay_M_R — Mon, 04 Aug 2025 04:53:52 GMT

Hello @JN_Bristol ,

There are some limitations while using `ai_parse_document` function:

1.) While Databricks is continuously working to improve all of its features, LLMs are an emerging technology and may produce errors.

2.) The ai_parse_document function can take time to extract document content while preserving structural information, especially for documents that contain highly dense content or content with poor resolution. In some cases, the function may take a while to run or ignore content. Databricks is continuously working to improve latency.

I am sharing official documentation for your reference: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#limitations

Suggestion:

Also your input data files must be stored as blob data in bytes, meaning a binary type column in a dataframe or Delta table. As your source documents are stored in a Unity Catalog volume, can you generate binary type column using Spark binaryFile format reader?

I am sharing official documentation for your reference: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#-supported-input-file-formats

Re: ai_parse_document struggling to detect pdf

Sharanya13 — Mon, 04 Aug 2025 11:03:37 GMT

@JN_Bristol Can you describe the PDF document (size, contents) or share it? I have mixed experience with ai_parse

Re: ai_parse_document struggling to detect pdf

JN_Bristol — Mon, 04 Aug 2025 20:19:03 GMT

Hi @Sharanya13

It's an actual bank statement (not dummy data)... so, alas no, I cannot share it 😐 It's 6 pages, and contains a mixture of tables, graphics, and summary small print.

Are you suggesting I try "ai_parse" instead of "ai_parse_document"? ok, I'll give that a go 🙏🏻

Thanks 🙂

Re: ai_parse_document struggling to detect pdf

JN_Bristol — Mon, 04 Aug 2025 20:21:21 GMT

Hi @Vinay_M_R

Thanks for replying. The docs link is the same as the link that I included in my original post - and it is where I am following the code examples from. That example shows a pdf being read from a Volume - but are you saying I should not do this and should read directly from a Blob store instead? 🤔 I thought the Databricks position was that Volumes are the way forward?

Re: ai_parse_document struggling to detect pdf

lucaperes — Thu, 04 Dec 2025 14:00:45 GMT

Hello @JN_Bristol,

I discovered that ai_parse_document only works when the input is parsed as real Python bytes.
The binaryFile format in Spark returns the content as an internal binary type (like a memoryview), and ai_parse_document can’t process that directly.
By using a UDF to convert the data into actual bytes, the function starts working correctly.

from pyspark.sql.functions import ai_parse_document import pyspark.sql.functions as F from pyspark.sql.types import BinaryType import base64 from io import BytesIO def conversor(content): pdf_bytes = base64.b64decode(content) pdf_file_like_object = BytesIO(pdf_bytes) return pdf_file_like_object.read() conversor_udf = F.udf(conversor, BinaryType()) volume_path = '/Volumes/catalog/schema/volumn/' raw_pdfs = ( spark.read .format('binaryFile') .load(f'{volume_path}/*.pdf') ).limit(1) display(raw_pdfs) parsed_pdfs = ( raw_pdfs.withColumn( 'content_bin',conversor_udf('content') ) .withColumn( 'content_parsed', ai_parse_document('content_bin') ) ) display(parsed_pdfs)

Re: ai_parse_document struggling to detect pdf

JN_Bristol — Sun, 07 Dec 2025 14:50:17 GMT

Hi @luca wow!! Thanks for this - that's exactly the code snippet I needed 😊

Kudos very well earned 🙏🏻