cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

PDF Parsing in Notebook

Kamal2
New Contributor II

I have pdf files stored in azure adls.

i want to parse pdf files in pyspark dataframes

how can i do that ?

1 ACCEPTED SOLUTION

Accepted Solutions

morganmazouchi
Databricks Employee
Databricks Employee

If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

I know of Apache Tika. But that is a java lib and I do not know if there are python bindings.

Pypi has a python version though:

https://pypi.org/project/tika/

It might help.

morganmazouchi
Databricks Employee
Databricks Employee

If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.

Mykola_Melnyk
New Contributor III

Please look to the PDF DataSource for Apache Spark.

This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "200") \
    .option("pagePerPartition", "2") \
    .option("reader", "pdfBox") \
    .load("path to the pdf file(s)")

df.show()

  

I'm developing document processing using Spark.

Mykola_Melnyk
New Contributor III

PDF Data Source works now on Databricks.
Instruction with example: https://stabrise.com/blog/spark-pdf-on-databricks/

I'm developing document processing using Spark.

Spark PDF works now with Unity Catalog volumes, started from 0.1.16 version: more details here: https://stabrise.com/blog/spark-pdf-databricks-unity-catalog/

I'm developing document processing using Spark.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now