cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to read a PDF file from Azure Datalake blob storage to Databricks

PunithRaj
New Contributor

I have a scenario where I need to read a pdf file from "Azure Datalake blob storage to Databricks", where connection is done through AD access.

Generating the SAS token has been restricted in our environment due to security issues.

The below script can read out the name of pdf files in the folder.

pdf_path = "abfss:datalakename.dfs.core.windows.net/<container folder path>"

pdf_df = spark.read.format("binaryFile").load(pdf_path).cache()

display(pdf_df)

However, after above step finding difficulty in passing the pdf file to formrecognizer function.

So, if anyone has tried implementing the PDF file reading from Azure Datalake to Databricks, Please help me with the script or the way to do it.

Many thanks in advance!

Best Regards,

Punith Raj

1 REPLY 1

Aviral-Bhardwaj
Esteemed Contributor III

Hey @Punith raj​ ,

Not sure about Azure but in AWS there is one service known as AWS Transact Please try to explore that onces

AviralBhardwaj

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group