cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

FileNotFoundError while reading PDF file in Databricks from DBFS location

sahil07
New Contributor III

I am trying to read a PDF file from DBFS location in Databricks using PyPDF2.PdfFileReader but it's throwing error that file doesn't exist

sahil07_0-1724773944269.png

But the file exists in the path, refer below screenshot

sahil07_1-1724773977934.png

Can anyone please suggest what is wrong in this?

1 ACCEPTED SOLUTION

Accepted Solutions

Lucas_TBrabo
Databricks Employee
Databricks Employee

@sahil07, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"

Try the following:

test_dbx.jpg

 

View solution in original post

8 REPLIES 8

Lucas_TBrabo
Databricks Employee
Databricks Employee

Hi @sahil07!

As you are reading using PyPDF2, which does not use the spark API to read data, you should use "/dbfs/FileStore/sahil_chowdhurry.pdf" instead of "dbfs:/FileStore/sahil_chowdhurry.pdf". 

As a general rule of thumb: If you are using readers that talks with the spark API, use the "dbfs:/", otherwise, use "/dbfs/".

Test it and let me know if it worked ๐Ÿ™‚

sahil07
New Contributor III

Hi @Lucas_TBrabo 

Used the one you suggested but same issue

sahil07_0-1724781754505.png

 

Lucas_TBrabo
Databricks Employee
Databricks Employee

@sahil07 are you running this in a serverless cluster? If not, please let me know the config and runtime, please.

sahil07
New Contributor III

@Lucas_TBrabo I am using databricks community edition, DBR 14.3 LTS Spark 3.5.0 Scala 2.12

sahil07_0-1724787081914.png

 

Lucas_TBrabo
Databricks Employee
Databricks Employee

@sahil07, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"

Try the following:

test_dbx.jpg

 

sahil07
New Contributor III

Thanks a lot @Lucas_TBrabo it worked.

But I am just wondering when I was trying to read csv files utilising the same cluster configs and using spark.read.csv() , I was able to read it without any issues. So, is it something related to PDF files that we can't directly read it from DBFS? And if yes then what kind of cluster configs is required to read PDF files directly from DBFS?

Lucas_TBrabo
Databricks Employee
Databricks Employee

@sahil07, the fact that you could read a csv file using spark.read.csv() is because you're using the spark native API to access the dbfs, which works just fine. The PDF reading was not possible because PyPDF2 does not use the spark API, but python standard reader.

sahil07
New Contributor III

@Lucas_TBrabo thanks for the detailed explanation, really appreciate it. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group