Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-04-2025 06:00 AM
Hello @JN_Bristol,
I discovered that ai_parse_document only works when the input is parsed as real Python bytes.
The binaryFile format in Spark returns the content as an internal binary type (like a memoryview), and ai_parse_document can’t process that directly.
By using a UDF to convert the data into actual bytes, the function starts working correctly.
from pyspark.sql.functions import ai_parse_document
import pyspark.sql.functions as F
from pyspark.sql.types import BinaryType
import base64
from io import BytesIO
def conversor(content):
pdf_bytes = base64.b64decode(content)
pdf_file_like_object = BytesIO(pdf_bytes)
return pdf_file_like_object.read()
conversor_udf = F.udf(conversor, BinaryType())
volume_path = '/Volumes/catalog/schema/volumn/'
raw_pdfs = (
spark.read
.format('binaryFile')
.load(f'{volume_path}/*.pdf')
).limit(1)
display(raw_pdfs)
parsed_pdfs = (
raw_pdfs.withColumn(
'content_bin',conversor_udf('content')
)
.withColumn(
'content_parsed',
ai_parse_document('content_bin')
)
)
display(parsed_pdfs)