Databricks Community

JN_Bristol · ‎08-03-2025

Hi helpful experts 🌟

I'm writing my first PySpark Notebook that makes use of the new `ai_parse_document` function. I am basically following the code example from here: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document

(and doing it on Azure Databricks, if that helps)

My code:

from pyspark.sql.functions import ai_parse_document

volume_path = '/Volumes/gen_ai/bank_statements/raw_pdfs'

raw_pdfs = (
    spark.read
    .format('binaryFile')
    .load(f'{volume_path}/*.pdf')
)

# this line works fine... I can see 'length' = 332159 and 'content' is binary
raw_pdfs.display()

# this line runs ok... but the output is in the 'corrupted_data' property
parsed_pdfs = (
    raw_pdfs
    .withColumn(
        'content_parsed',
        ai_parse_document('content')
    )
)

The error message is:

error_message: "[UNSTRUCTURED_DATA_PROCESSING_UNSUPPORTED_FILE_FORMAT] Unstructured file format detected: unknown is not supported. Supported file formats are auto, pdf, doc, docx, ppt, pptx, png, jpg, jpeg.\nPlease update the `format` from your ai function expression to one of the supported formats and then retry the query again. SQLSTATE: 0A000"

And yet the file _is_ a PDF. I downloaded it from my bank, and can open it fine in Acrobat and other tools. So I don't think it's the file that can be corrupted? 🤔

Does anyone know what the error message means by "update the format from your ai function expression"? I can't see a parameter for that in the ai_parse_document documentation.

Alternatively, are there some PDFs that this (beta) function just can't handle yet?

Any advice much appreciated 🙏🏻

Vinay_M_R · ‎08-03-2025

Hello @JN_Bristol ,

There are some limitations while using `ai_parse_document` function:

1.) While Databricks is continuously working to improve all of its features, LLMs are an emerging technology and may produce errors.

2.) The ai_parse_document function can take time to extract document content while preserving structural information, especially for documents that contain highly dense content or content with poor resolution. In some cases, the function may take a while to run or ignore content. Databricks is continuously working to improve latency.

I am sharing official documentation for your reference: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#l...

Suggestion:

Also your input data files must be stored as blob data in bytes, meaning a binary type column in a dataframe or Delta table. As your source documents are stored in a Unity Catalog volume, can you generate binary type column using Spark binaryFile format reader?

I am sharing official documentation for your reference: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#-...

JN_Bristol · ‎08-04-2025

Hi @Vinay_M_R

Thanks for replying. The docs link is the same as the link that I included in my original post - and it is where I am following the code examples from. That example shows a pdf being read from a Volume - but are you saying I should not do this and should read directly from a Blob store instead? 🤔 I thought the Databricks position was that Volumes are the way forward?

Sharanya13 · ‎08-04-2025

@JN_Bristol Can you describe the PDF document (size, contents) or share it? I have mixed experience with ai_parse

JN_Bristol · ‎08-04-2025

Hi @Sharanya13

It's an actual bank statement (not dummy data)... so, alas no, I cannot share it 😐 It's 6 pages, and contains a mixture of tables, graphics, and summary small print.

Are you suggesting I try "ai_parse" instead of "ai_parse_document"? ok, I'll give that a go 🙏🏻

Thanks 🙂

Databricks Community

ai_parse_document struggling to detect pdf

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples