Re: ai_parse_document struggling to detect pdf

lucaperes · ‎12-04-2025

I discovered that ai_parse_document only works when the input is parsed as real Python bytes.
The binaryFile format in Spark returns the content as an internal binary type (like a memoryview), and ai_parse_document can’t process that directly.
By using a UDF to convert the data into actual bytes, the function starts working correctly.

from pyspark.sql.functions import ai_parse_document
import pyspark.sql.functions as F
from pyspark.sql.types import BinaryType
import base64
from io import BytesIO

def conversor(content):
    pdf_bytes = base64.b64decode(content)
    pdf_file_like_object = BytesIO(pdf_bytes)
    return pdf_file_like_object.read()


conversor_udf = F.udf(conversor, BinaryType())


volume_path = '/Volumes/catalog/schema/volumn/'

raw_pdfs = (
    spark.read
    .format('binaryFile')
    .load(f'{volume_path}/*.pdf')
).limit(1)


display(raw_pdfs)

parsed_pdfs = (
    raw_pdfs.withColumn(
        'content_bin',conversor_udf('content')
    )
    .withColumn(
        'content_parsed',
        ai_parse_document('content_bin')
    )
)
display(parsed_pdfs)

View solution in original post