Please look to the PDF DataSource for Apache Spark.This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.df = spark.read.format("pdf") \
...