<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to read a PDF file from Azure Datalake blob storage to Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-read-a-pdf-file-from-azure-datalake-blob-storage-to/m-p/115536#M45108</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/24177"&gt;@PunithRaj&lt;/a&gt;&amp;nbsp;You can try to use&amp;nbsp; PDF DataSource for Apache Spark for read pdf files directly to the DataFrame. So you will have extracted text and rendered page as image in output. More details here:&amp;nbsp;&lt;A href="https://stabrise.com/spark-pdf/" target="_self"&gt;https://stabrise.com/spark-pdf/&lt;/A&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")&lt;/LI-CODE&gt;</description>
    <pubDate>Tue, 15 Apr 2025 14:30:50 GMT</pubDate>
    <dc:creator>Mykola_Melnyk</dc:creator>
    <dc:date>2025-04-15T14:30:50Z</dc:date>
    <item>
      <title>How to read a PDF file from Azure Datalake blob storage to Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-a-pdf-file-from-azure-datalake-blob-storage-to/m-p/16728#M10854</link>
      <description>&lt;P&gt;I have a scenario where I need to read a pdf file from "Azure Datalake blob storage to Databricks", where connection is done through AD access.&lt;/P&gt;&lt;P&gt;Generating the SAS token has been restricted in our environment due to security issues. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The below script can read out the name of pdf files in the folder.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;pdf_path = "abfss:datalakename.dfs.core.windows.net/&amp;lt;container folder path&amp;gt;"&lt;/P&gt;&lt;P&gt;pdf_df = spark.read.format("binaryFile").load(pdf_path).cache()&lt;/P&gt;&lt;P&gt;display(pdf_df)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However, after above step finding difficulty in passing the pdf file to formrecognizer function.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So, if anyone has tried implementing the PDF file reading from Azure Datalake to Databricks, Please help me with the script or the way to do it.&lt;/P&gt;&lt;P&gt;Many thanks in advance!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Best Regards,&lt;/P&gt;&lt;P&gt;Punith Raj &lt;/P&gt;</description>
      <pubDate>Thu, 15 Dec 2022 14:24:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-a-pdf-file-from-azure-datalake-blob-storage-to/m-p/16728#M10854</guid>
      <dc:creator>PunithRaj</dc:creator>
      <dc:date>2022-12-15T14:24:42Z</dc:date>
    </item>
    <item>
      <title>Re: How to read a PDF file from Azure Datalake blob storage to Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-a-pdf-file-from-azure-datalake-blob-storage-to/m-p/16729#M10855</link>
      <description>&lt;P&gt;Hey @Punith raj​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Not sure about Azure but in AWS there is one service known as AWS Transact Please try to explore that onces &lt;/P&gt;</description>
      <pubDate>Tue, 20 Dec 2022 13:59:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-a-pdf-file-from-azure-datalake-blob-storage-to/m-p/16729#M10855</guid>
      <dc:creator>Aviral-Bhardwaj</dc:creator>
      <dc:date>2022-12-20T13:59:14Z</dc:date>
    </item>
    <item>
      <title>Re: How to read a PDF file from Azure Datalake blob storage to Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-a-pdf-file-from-azure-datalake-blob-storage-to/m-p/115536#M45108</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/24177"&gt;@PunithRaj&lt;/a&gt;&amp;nbsp;You can try to use&amp;nbsp; PDF DataSource for Apache Spark for read pdf files directly to the DataFrame. So you will have extracted text and rendered page as image in output. More details here:&amp;nbsp;&lt;A href="https://stabrise.com/spark-pdf/" target="_self"&gt;https://stabrise.com/spark-pdf/&lt;/A&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")&lt;/LI-CODE&gt;</description>
      <pubDate>Tue, 15 Apr 2025 14:30:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-a-pdf-file-from-azure-datalake-blob-storage-to/m-p/115536#M45108</guid>
      <dc:creator>Mykola_Melnyk</dc:creator>
      <dc:date>2025-04-15T14:30:50Z</dc:date>
    </item>
  </channel>
</rss>

