08-03-2025 04:43 AM
Hi helpful experts 🌟
I'm writing my first PySpark Notebook that makes use of the new `ai_parse_document` function. I am basically following the code example from here: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document
(and doing it on Azure Databricks, if that helps)
My code:
from pyspark.sql.functions import ai_parse_document
volume_path = '/Volumes/gen_ai/bank_statements/raw_pdfs'
raw_pdfs = (
spark.read
.format('binaryFile')
.load(f'{volume_path}/*.pdf')
)
# this line works fine... I can see 'length' = 332159 and 'content' is binary
raw_pdfs.display()
# this line runs ok... but the output is in the 'corrupted_data' property
parsed_pdfs = (
raw_pdfs
.withColumn(
'content_parsed',
ai_parse_document('content')
)
)The error message is:
error_message: "[UNSTRUCTURED_DATA_PROCESSING_UNSUPPORTED_FILE_FORMAT] Unstructured file format detected: unknown is not supported. Supported file formats are auto, pdf, doc, docx, ppt, pptx, png, jpg, jpeg.\nPlease update the `format` from your ai function expression to one of the supported formats and then retry the query again. SQLSTATE: 0A000"And yet the file _is_ a PDF. I downloaded it from my bank, and can open it fine in Acrobat and other tools. So I don't think it's the file that can be corrupted? 🤔
Does anyone know what the error message means by "update the format from your ai function expression"? I can't see a parameter for that in the ai_parse_document documentation.
Alternatively, are there some PDFs that this (beta) function just can't handle yet?
Any advice much appreciated 🙏🏻
Thursday
Hello @JN_Bristol,
I discovered that ai_parse_document only works when the input is parsed as real Python bytes.
The binaryFile format in Spark returns the content as an internal binary type (like a memoryview), and ai_parse_document can’t process that directly.
By using a UDF to convert the data into actual bytes, the function starts working correctly.
from pyspark.sql.functions import ai_parse_document
import pyspark.sql.functions as F
from pyspark.sql.types import BinaryType
import base64
from io import BytesIO
def conversor(content):
pdf_bytes = base64.b64decode(content)
pdf_file_like_object = BytesIO(pdf_bytes)
return pdf_file_like_object.read()
conversor_udf = F.udf(conversor, BinaryType())
volume_path = '/Volumes/catalog/schema/volumn/'
raw_pdfs = (
spark.read
.format('binaryFile')
.load(f'{volume_path}/*.pdf')
).limit(1)
display(raw_pdfs)
parsed_pdfs = (
raw_pdfs.withColumn(
'content_bin',conversor_udf('content')
)
.withColumn(
'content_parsed',
ai_parse_document('content_bin')
)
)
display(parsed_pdfs)
08-03-2025 09:53 PM
Hello @JN_Bristol ,
There are some limitations while using `ai_parse_document` function:
1.) While Databricks is continuously working to improve all of its features, LLMs are an emerging technology and may produce errors.
2.) The ai_parse_document function can take time to extract document content while preserving structural information, especially for documents that contain highly dense content or content with poor resolution. In some cases, the function may take a while to run or ignore content. Databricks is continuously working to improve latency.
I am sharing official documentation for your reference: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#l...
Suggestion:
Also your input data files must be stored as blob data in bytes, meaning a binary type column in a dataframe or Delta table. As your source documents are stored in a Unity Catalog volume, can you generate binary type column using Spark binaryFile format reader?
I am sharing official documentation for your reference: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document#-...
08-04-2025 01:21 PM
Hi @Vinay_M_R
Thanks for replying. The docs link is the same as the link that I included in my original post - and it is where I am following the code examples from. That example shows a pdf being read from a Volume - but are you saying I should not do this and should read directly from a Blob store instead? 🤔 I thought the Databricks position was that Volumes are the way forward?
08-04-2025 04:03 AM
@JN_Bristol Can you describe the PDF document (size, contents) or share it? I have mixed experience with ai_parse
08-04-2025 01:19 PM
Hi @Sharanya13
It's an actual bank statement (not dummy data)... so, alas no, I cannot share it 😐 It's 6 pages, and contains a mixture of tables, graphics, and summary small print.
Are you suggesting I try "ai_parse" instead of "ai_parse_document"? ok, I'll give that a go 🙏🏻
Thanks 🙂
Thursday
Hello @JN_Bristol,
I discovered that ai_parse_document only works when the input is parsed as real Python bytes.
The binaryFile format in Spark returns the content as an internal binary type (like a memoryview), and ai_parse_document can’t process that directly.
By using a UDF to convert the data into actual bytes, the function starts working correctly.
from pyspark.sql.functions import ai_parse_document
import pyspark.sql.functions as F
from pyspark.sql.types import BinaryType
import base64
from io import BytesIO
def conversor(content):
pdf_bytes = base64.b64decode(content)
pdf_file_like_object = BytesIO(pdf_bytes)
return pdf_file_like_object.read()
conversor_udf = F.udf(conversor, BinaryType())
volume_path = '/Volumes/catalog/schema/volumn/'
raw_pdfs = (
spark.read
.format('binaryFile')
.load(f'{volume_path}/*.pdf')
).limit(1)
display(raw_pdfs)
parsed_pdfs = (
raw_pdfs.withColumn(
'content_bin',conversor_udf('content')
)
.withColumn(
'content_parsed',
ai_parse_document('content_bin')
)
)
display(parsed_pdfs)
yesterday
Hi @luca wow!! Thanks for this - that's exactly the code snippet I needed 😊
Kudos very well earned 🙏🏻
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now