topic Re: How to read a PDF file from Azure Datalake blob storage to Databricks in Data Engineering

How to read a PDF file from Azure Datalake blob storage to Databricks

PunithRaj — Thu, 15 Dec 2022 14:24:42 GMT

I have a scenario where I need to read a pdf file from "Azure Datalake blob storage to Databricks", where connection is done through AD access.

Generating the SAS token has been restricted in our environment due to security issues.

The below script can read out the name of pdf files in the folder.

pdf_path = "abfss:datalakename.dfs.core.windows.net/<container folder path>"

pdf_df = spark.read.format("binaryFile").load(pdf_path).cache()

display(pdf_df)

However, after above step finding difficulty in passing the pdf file to formrecognizer function.

So, if anyone has tried implementing the PDF file reading from Azure Datalake to Databricks, Please help me with the script or the way to do it.

Many thanks in advance!

Best Regards,

Punith Raj

Re: How to read a PDF file from Azure Datalake blob storage to Databricks

Aviral-Bhardwaj — Tue, 20 Dec 2022 13:59:14 GMT

Hey @Punith raj ,

Not sure about Azure but in AWS there is one service known as AWS Transact Please try to explore that onces

Re: How to read a PDF file from Azure Datalake blob storage to Databricks

Mykola_Melnyk — Tue, 15 Apr 2025 14:30:50 GMT

@PunithRaj You can try to use PDF DataSource for Apache Spark for read pdf files directly to the DataFrame. So you will have extracted text and rendered page as image in output. More details here: https://stabrise.com/spark-pdf/

df = spark.read.format("pdf") \ .option("imageType", "BINARY") \ .option("resolution", "200") \ .option("pagePerPartition", "2") \ .option("reader", "pdfBox") \ .load("path to the pdf file(s)")