Databricks Community

sahil07 · ‎08-27-2024

I am trying to read a PDF file from DBFS location in Databricks using PyPDF2.PdfFileReader but it's throwing error that file doesn't exist

But the file exists in the path, refer below screenshot

Can anyone please suggest what is wrong in this?

Lucas_TBrabo · ‎08-27-2024

@sahil07, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"

Try the following:

View solution in original post

Lucas_TBrabo · ‎08-27-2024

Hi @sahil07!

As you are reading using PyPDF2, which does not use the spark API to read data, you should use "/dbfs/FileStore/sahil_chowdhurry.pdf" instead of "dbfs:/FileStore/sahil_chowdhurry.pdf".

As a general rule of thumb: If you are using readers that talks with the spark API, use the "dbfs:/", otherwise, use "/dbfs/".

Test it and let me know if it worked 🙂

sahil07 · ‎08-27-2024

Hi @Lucas_TBrabo

Used the one you suggested but same issue

Lucas_TBrabo · ‎08-27-2024

@sahil07 are you running this in a serverless cluster? If not, please let me know the config and runtime, please.

sahil07 · ‎08-27-2024

@Lucas_TBrabo I am using databricks community edition, DBR 14.3 LTS Spark 3.5.0 Scala 2.12

Lucas_TBrabo · ‎08-27-2024

@sahil07, It seems that with your current setup, you can't read from DBFS using vanilla Python. I've ran some tests and managed to reproduce the error and solve it by copying the file in DBFS to the local file system of the driver node using dbutils.fs.cp to copy to "file:/"

Try the following:

sahil07 · ‎08-27-2024

Thanks a lot @Lucas_TBrabo it worked.

But I am just wondering when I was trying to read csv files utilising the same cluster configs and using spark.read.csv() , I was able to read it without any issues. So, is it something related to PDF files that we can't directly read it from DBFS? And if yes then what kind of cluster configs is required to read PDF files directly from DBFS?

Lucas_TBrabo · ‎08-28-2024

@sahil07, the fact that you could read a csv file using spark.read.csv() is because you're using the spark native API to access the dbfs, which works just fine. The PDF reading was not possible because PyPDF2 does not use the spark API, but python standard reader.