Hi helpful experts ๐
I'm writing my first PySpark Notebook that makes use of the new `ai_parse_document` function. I am basically following the code example from here: https://learn.microsoft.com/en-gb/azure/databricks/sql/language-manual/functions/ai_parse_document
(and doing it on Azure Databricks, if that helps)
My code:
from pyspark.sql.functions import ai_parse_document
volume_path = '/Volumes/gen_ai/bank_statements/raw_pdfs'
raw_pdfs = (
spark.read
.format('binaryFile')
.load(f'{volume_path}/*.pdf')
)
# this line works fine... I can see 'length' = 332159 and 'content' is binary
raw_pdfs.display()
# this line runs ok... but the output is in the 'corrupted_data' property
parsed_pdfs = (
raw_pdfs
.withColumn(
'content_parsed',
ai_parse_document('content')
)
)
The error message is:
error_message: "[UNSTRUCTURED_DATA_PROCESSING_UNSUPPORTED_FILE_FORMAT] Unstructured file format detected: unknown is not supported. Supported file formats are auto, pdf, doc, docx, ppt, pptx, png, jpg, jpeg.\nPlease update the `format` from your ai function expression to one of the supported formats and then retry the query again. SQLSTATE: 0A000"
And yet the file _is_ a PDF. I downloaded it from my bank, and can open it fine in Acrobat and other tools. So I don't think it's the file that can be corrupted? ๐ค
Does anyone know what the error message means by "update the format from your ai function expression"? I can't see a parameter for that in the ai_parse_document documentation.
Alternatively, are there some PDFs that this (beta) function just can't handle yet?
Any advice much appreciated ๐๐ป