Databricks Community

Kamal2 · ‎09-23-2021

I have pdf files stored in azure adls.

i want to parse pdf files in pyspark dataframes

how can i do that ?

morganmazouchi · ‎10-15-2021

If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.

View solution in original post

-werners- · ‎09-23-2021

I know of Apache Tika. But that is a java lib and I do not know if there are python bindings.

Pypi has a python version though:

https://pypi.org/project/tika/

It might help.

morganmazouchi · ‎10-15-2021

If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.

Mykola_Melnyk · 12 hours ago

Please look to the PDF DataSource for Apache Spark.

This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "200") \
    .option("pagePerPartition", "2") \
    .option("reader", "pdfBox") \
    .load("path to the pdf file(s)")

df.show()

Databricks Community

PDF Parsing in Notebook

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

How to present and share your Notebook insights in AI/BI Dashboards

Introducing an exclusively Databricks-hosted Assistant

Meet the Databricks MVPs

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs