08-27-2024 08:53 AM
I am trying to read a PDF file from DBFS location in Databricks using PyPDF2.PdfFileReader but it's throwing error that file doesn't exist
But the file exists in the path, refer below screenshot
Can anyone please suggest what is wrong in this?
08-27-2024 01:36 PM
@sahil07, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"
Try the following:
08-27-2024 10:40 AM
Hi @sahil07!
As you are reading using PyPDF2, which does not use the spark API to read data, you should use "/dbfs/FileStore/sahil_chowdhurry.pdf" instead of "dbfs:/FileStore/sahil_chowdhurry.pdf".
As a general rule of thumb: If you are using readers that talks with the spark API, use the "dbfs:/", otherwise, use "/dbfs/".
Test it and let me know if it worked 🙂
08-27-2024 11:03 AM
08-27-2024 12:25 PM
@sahil07 are you running this in a serverless cluster? If not, please let me know the config and runtime, please.
08-27-2024 12:31 PM
08-27-2024 01:36 PM
@sahil07, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"
Try the following:
08-27-2024 07:23 PM
Thanks a lot @Lucas_TBrabo it worked.
But I am just wondering when I was trying to read csv files utilising the same cluster configs and using spark.read.csv() , I was able to read it without any issues. So, is it something related to PDF files that we can't directly read it from DBFS? And if yes then what kind of cluster configs is required to read PDF files directly from DBFS?
08-28-2024 05:27 AM
@sahil07, the fact that you could read a csv file using spark.read.csv() is because you're using the spark native API to access the dbfs, which works just fine. The PDF reading was not possible because PyPDF2 does not use the spark API, but python standard reader.
08-28-2024 09:08 AM
@Lucas_TBrabo thanks for the detailed explanation, really appreciate it.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group