โ08-27-2024 08:53 AM
I am trying to read a PDF file from DBFS location in Databricks using PyPDF2.PdfFileReader but it's throwing error that file doesn't exist
But the file exists in the path, refer below screenshot
Can anyone please suggest what is wrong in this?
โ08-27-2024 01:36 PM
@sahil07, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"
Try the following:
โ08-27-2024 10:40 AM
Hi @sahil07!
As you are reading using PyPDF2, which does not use the spark API to read data, you should use "/dbfs/FileStore/sahil_chowdhurry.pdf" instead of "dbfs:/FileStore/sahil_chowdhurry.pdf".
As a general rule of thumb: If you are using readers that talks with the spark API, use the "dbfs:/", otherwise, use "/dbfs/".
Test it and let me know if it worked ๐
โ08-27-2024 11:03 AM
โ08-27-2024 12:25 PM
@sahil07 are you running this in a serverless cluster? If not, please let me know the config and runtime, please.
โ08-27-2024 12:31 PM
โ08-27-2024 01:36 PM
@sahil07, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"
Try the following:
โ08-27-2024 07:23 PM
Thanks a lot @Lucas_TBrabo it worked.
But I am just wondering when I was trying to read csv files utilising the same cluster configs and using spark.read.csv() , I was able to read it without any issues. So, is it something related to PDF files that we can't directly read it from DBFS? And if yes then what kind of cluster configs is required to read PDF files directly from DBFS?
โ08-28-2024 05:27 AM
@sahil07, the fact that you could read a csv file using spark.read.csv() is because you're using the spark native API to access the dbfs, which works just fine. The PDF reading was not possible because PyPDF2 does not use the spark API, but python standard reader.
โ08-28-2024 09:08 AM
@Lucas_TBrabo thanks for the detailed explanation, really appreciate it.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group