โ09-23-2021 01:37 AM
โ10-15-2021 08:31 AM
If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.
โ09-23-2021 05:31 AM
I know of Apache Tika. But that is a java lib and I do not know if there are python bindings.
Pypi has a python version though:
https://pypi.org/project/tika/
It might help.
โ10-15-2021 08:31 AM
If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.
2 weeks ago
Please look to the PDF DataSource for Apache Spark.
This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.
df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")
df.show()
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group