How to read a PDF file from Azure Datalake blob storage to Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-15-2022 06:24 AM
I have a scenario where I need to read a pdf file from "Azure Datalake blob storage to Databricks", where connection is done through AD access.
Generating the SAS token has been restricted in our environment due to security issues.
The below script can read out the name of pdf files in the folder.
pdf_path = "abfss:datalakename.dfs.core.windows.net/<container folder path>"
pdf_df = spark.read.format("binaryFile").load(pdf_path).cache()
display(pdf_df)
However, after above step finding difficulty in passing the pdf file to formrecognizer function.
So, if anyone has tried implementing the PDF file reading from Azure Datalake to Databricks, Please help me with the script or the way to do it.
Many thanks in advance!
Best Regards,
Punith Raj
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2022 05:59 AM
Hey @Punith raj ,
Not sure about Azure but in AWS there is one service known as AWS Transact Please try to explore that onces

