- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-23-2021 01:37 AM
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-15-2021 08:31 AM
If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-23-2021 05:31 AM
I know of Apache Tika. But that is a java lib and I do not know if there are python bindings.
Pypi has a python version though:
https://pypi.org/project/tika/
It might help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-15-2021 08:31 AM
If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-26-2024 02:06 AM
Please look to the PDF DataSource for Apache Spark.
This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.
df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")
df.show()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-02-2025 09:17 AM
PDF Data Source works now on Databricks.
Instruction with example: https://stabrise.com/blog/spark-pdf-on-databricks/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tuesday
Spark PDF works now with Unity Catalog volumes, started from 0.1.16 version: more details here: https://stabrise.com/blog/spark-pdf-databricks-unity-catalog/

