Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-26-2024 02:06 AM
Please look to the PDF DataSource for Apache Spark.
This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.
df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")
df.show()
I'm developing document processing using Spark.
PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it - spark-pdf/examples/PdfDataSource.ipynb at main · StabRise/spark-pdf