cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to read a PDF file from Azure Datalake blob storage to Databricks

PunithRaj
New Contributor

I have a scenario where I need to read a pdf file from "Azure Datalake blob storage to Databricks", where connection is done through AD access.

Generating the SAS token has been restricted in our environment due to security issues.

The below script can read out the name of pdf files in the folder.

pdf_path = "abfss:datalakename.dfs.core.windows.net/<container folder path>"

pdf_df = spark.read.format("binaryFile").load(pdf_path).cache()

display(pdf_df)

However, after above step finding difficulty in passing the pdf file to formrecognizer function.

So, if anyone has tried implementing the PDF file reading from Azure Datalake to Databricks, Please help me with the script or the way to do it.

Many thanks in advance!

Best Regards,

Punith Raj

1 REPLY 1

Aviral-Bhardwaj
Esteemed Contributor III

Hey @Punith raj​ ,

Not sure about Azure but in AWS there is one service known as AWS Transact Please try to explore that onces

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.