Re: PDF Parsing in Notebook

Mykola_Melnyk · ‎11-26-2024

Please look to the PDF DataSource for Apache Spark.

This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "200") \
    .option("pagePerPartition", "2") \
    .option("reader", "pdfBox") \
    .load("path to the pdf file(s)")

df.show()

I'm developing document processing using Spark.