topic Re: PDF Parsing in Notebook in Data Engineering

PDF Parsing in Notebook

Kamal2 — Thu, 23 Sep 2021 08:37:09 GMT

I have pdf files stored in azure adls.

i want to parse pdf files in pyspark dataframes

how can i do that ?

Re: PDF Parsing in Notebook

-werners- — Thu, 23 Sep 2021 12:31:42 GMT

I know of Apache Tika. But that is a java lib and I do not know if there are python bindings.

Pypi has a python version though:

https://pypi.org/project/tika/

It might help.

Re: PDF Parsing in Notebook

morganmazouchi — Fri, 15 Oct 2021 15:31:23 GMT

If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.

Re: PDF Parsing in Notebook

Mykola_Melnyk — Tue, 26 Nov 2024 10:06:46 GMT

Please look to the PDF DataSource for Apache Spark.

This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.

df = spark.read.format("pdf") \ .option("imageType", "BINARY") \ .option("resolution", "200") \ .option("pagePerPartition", "2") \ .option("reader", "pdfBox") \ .load("path to the pdf file(s)") df.show()

Re: PDF Parsing in Notebook

Mykola_Melnyk — Sun, 02 Feb 2025 17:17:29 GMT

PDF Data Source works now on Databricks.
Instruction with example: https://stabrise.com/blog/spark-pdf-on-databricks/

Re: PDF Parsing in Notebook

Mykola_Melnyk — Tue, 15 Apr 2025 14:22:29 GMT

Spark PDF works now with Unity Catalog volumes, started from 0.1.16 version: more details here: https://stabrise.com/blog/spark-pdf-databricks-unity-catalog/