Databricks

Kamal2 · ‎09-23-2021

I have pdf files stored in azure adls.

i want to parse pdf files in pyspark dataframes

how can i do that ?

User16752240003 · ‎10-15-2021

If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.

View solution in original post

Kaniz · ‎09-23-2021

Hi @ Kamal ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the community have an answer to your question first. Or else I will follow up with my team and get back to you soon.Thanks.

-werners- · ‎09-23-2021

I know of Apache Tika. But that is a java lib and I do not know if there are python bindings.

Pypi has a python version though:

https://pypi.org/project/tika/

It might help.

User16752240003 · ‎10-15-2021

If you have familiarity with Scala you can use Tika. Tika is a wrapper around PDFBox. In case you want to use it in Databricks I suggest you to go through this blog and Git repo. For python based codes you may want to use PyPDF2 as a pandas UDF in Spark.

Databricks

PDF Parsing in Notebook

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs