<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: FileNotFoundError while reading PDF file in Databricks from DBFS location in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84971#M37226</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/117693"&gt;@sahil07&lt;/a&gt;, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS&amp;nbsp;&lt;SPAN&gt;to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Try the following:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="test_dbx.jpg" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10654iAA28D2047704F883/image-size/medium?v=v2&amp;amp;px=400" role="button" title="test_dbx.jpg" alt="test_dbx.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 27 Aug 2024 20:36:19 GMT</pubDate>
    <dc:creator>Lucas_TBrabo</dc:creator>
    <dc:date>2024-08-27T20:36:19Z</dc:date>
    <item>
      <title>FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84452#M37196</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I am trying to read a PDF file from DBFS location in Databricks using PyPDF2.PdfFileReader but it's throwing error that file doesn't exist&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sahil07_0-1724773944269.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10645i507CE2616980DBEB/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="sahil07_0-1724773944269.png" alt="sahil07_0-1724773944269.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;But the file exists in the path, refer below screenshot&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sahil07_1-1724773977934.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10646iDCA72660088F811A/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="sahil07_1-1724773977934.png" alt="sahil07_1-1724773977934.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Can anyone please suggest what is wrong in this?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2024 15:53:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84452#M37196</guid>
      <dc:creator>sahil07</dc:creator>
      <dc:date>2024-08-27T15:53:19Z</dc:date>
    </item>
    <item>
      <title>Re: FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84522#M37201</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/117693"&gt;@sahil07&lt;/a&gt;!&lt;/P&gt;
&lt;P&gt;As you are reading using PyPDF2, which does not use the spark API to read data, you should use&amp;nbsp;&lt;SPAN&gt;"/dbfs/FileStore/sahil_chowdhurry.pdf" instead of&amp;nbsp;"dbfs:/FileStore/sahil_chowdhurry.pdf".&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;As a general rule of thumb: If you are using readers that talks with the spark API, use the "dbfs:/", otherwise, use "/dbfs/".&lt;/P&gt;
&lt;P&gt;Test it and let me know if it worked &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2024 17:40:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84522#M37201</guid>
      <dc:creator>Lucas_TBrabo</dc:creator>
      <dc:date>2024-08-27T17:40:15Z</dc:date>
    </item>
    <item>
      <title>Re: FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84577#M37202</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116514"&gt;@Lucas_TBrabo&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Used the one you suggested but same issue&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sahil07_0-1724781754505.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10650i4B2C0EA0218EA3D8/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="sahil07_0-1724781754505.png" alt="sahil07_0-1724781754505.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2024 18:03:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84577#M37202</guid>
      <dc:creator>sahil07</dc:creator>
      <dc:date>2024-08-27T18:03:48Z</dc:date>
    </item>
    <item>
      <title>Re: FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84740#M37223</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/117693"&gt;@sahil07&lt;/a&gt;&amp;nbsp;are you running this in a serverless cluster? If not, please let me know the config and runtime, please.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2024 19:25:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84740#M37223</guid>
      <dc:creator>Lucas_TBrabo</dc:creator>
      <dc:date>2024-08-27T19:25:59Z</dc:date>
    </item>
    <item>
      <title>Re: FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84757#M37224</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116514"&gt;@Lucas_TBrabo&lt;/a&gt;&amp;nbsp;I am using databricks community edition, DBR 14.3 LTS Spark 3.5.0 Scala 2.12&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sahil07_0-1724787081914.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10653i06CA563E3AB9DB8F/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="sahil07_0-1724787081914.png" alt="sahil07_0-1724787081914.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2024 19:31:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84757#M37224</guid>
      <dc:creator>sahil07</dc:creator>
      <dc:date>2024-08-27T19:31:34Z</dc:date>
    </item>
    <item>
      <title>Re: FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84971#M37226</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/117693"&gt;@sahil07&lt;/a&gt;, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS&amp;nbsp;&lt;SPAN&gt;to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Try the following:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="test_dbx.jpg" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10654iAA28D2047704F883/image-size/medium?v=v2&amp;amp;px=400" role="button" title="test_dbx.jpg" alt="test_dbx.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2024 20:36:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/84971#M37226</guid>
      <dc:creator>Lucas_TBrabo</dc:creator>
      <dc:date>2024-08-27T20:36:19Z</dc:date>
    </item>
    <item>
      <title>Re: FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/85162#M37233</link>
      <description>&lt;P&gt;Thanks a lot&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116514"&gt;@Lucas_TBrabo&lt;/a&gt;&amp;nbsp;it worked.&lt;/P&gt;&lt;P&gt;But I am just wondering when I was trying to read csv files utilising the same cluster configs and using spark.read.csv() , I was able to read it without any issues. So, is it something related to PDF files that we can't directly read it from DBFS? And if yes then what kind of cluster configs is required to read PDF files directly from DBFS?&lt;/P&gt;</description>
      <pubDate>Wed, 28 Aug 2024 02:23:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/85162#M37233</guid>
      <dc:creator>sahil07</dc:creator>
      <dc:date>2024-08-28T02:23:33Z</dc:date>
    </item>
    <item>
      <title>Re: FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/85802#M37269</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/117693"&gt;@sahil07&lt;/a&gt;,&amp;nbsp;the fact that you could read a csv file using spark.read.csv() is because you're using the spark native API to access the dbfs, which works just fine. The PDF reading was not possible because PyPDF2 does not use the spark API, but python standard reader.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Aug 2024 12:27:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/85802#M37269</guid>
      <dc:creator>Lucas_TBrabo</dc:creator>
      <dc:date>2024-08-28T12:27:51Z</dc:date>
    </item>
    <item>
      <title>Re: FileNotFoundError while reading PDF file in Databricks from DBFS location</title>
      <link>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/85856#M37277</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116514"&gt;@Lucas_TBrabo&lt;/a&gt;&amp;nbsp;thanks for the detailed explanation, really appreciate it.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 28 Aug 2024 16:08:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filenotfounderror-while-reading-pdf-file-in-databricks-from-dbfs/m-p/85856#M37277</guid>
      <dc:creator>sahil07</dc:creator>
      <dc:date>2024-08-28T16:08:05Z</dc:date>
    </item>
  </channel>
</rss>

