cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Problems with pandas.read_parquet() and path

johnb1
New Contributor III

I am doing the "Data Engineering with Databricks V2" learning path.

I cannot run "DE 4.2 - Providing Options for External Sources", as the first code cell does not run successful:

%run ../Includes/Classroom-Setup-04.2

Screenshot 1:

MicrosoftTeams-image 

Inside the setup notebook, the code crashes at the following command (see screenshot 2):

df = pd.read_parquet(path = datasource_path.replace("dbfs:/", '/dbfs/'))

The error message is:

FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/dbacademy-datasets/data-engineering-with-databricks/v02/ecommerce/raw/users-historical'

Screenshot 2:

MicrosoftTeams-image (1) 

There seems to be an issue with the path, even though it actually exists:

Screenshot 3:

Capture 

I played around a little with the path specification, but nothing helped:

Screenshot 4:

Capture_2 

12 REPLIES 12

UmaMahesh1
Honored Contributor III

Hi @John B​ 

Can you please try by removing the dbfs and starting with /mnt only.

Also, if this does not work, can you please upload that notebooks DBC archive, so that I would be able to check the details.

Cheers..

johnb1
New Contributor III

Hi @Uma Maheswara Rao Desula​ 

Removing the dbfs and starting with /mnt only does not help.

Capture_3 

Br.

UmaMahesh1
Honored Contributor III

Also @John B​ 

Assuming this is an old training course, check the same using a community cluster with DBR version less than 7. Some old training courses mount points are disabled in DBR 7+.

Cheers...

UmaMahesh1
Honored Contributor III

@John B​ 

Did your issue get resolved?

If not through the above methods, do ping the fix you did.

Cheers..

johnb1
New Contributor III

@Uma Maheswara Rao Desula​ I solved the issue using ss2's suggestion (see below). After reading in a Spark DataFrame I converted it into a pandas DataFrame using the ToPandas() method.

johnb1
New Contributor III

Hi!

I can only use Runtime 7.3, 9.1., ..., 12.0. Minimum is 7.3. I am using DBR commnunity edition.

Br.

SS2
Valued Contributor

Can u try like this.spark.read.parquet("dbfs:/mnt/.......")​

johnb1
New Contributor III

Hi @S S​ 

Reading in the file was successful. However, I got a pyspark.sql.dataframe.DataFrame object. This is not the same as a pandas DataFrame, right?

Br.

Aviral-Bhardwaj
Esteemed Contributor III

Hey @S S​  ,

I can understand your issue

so to solve this import that DBC file and instead of question one there will be a folder for all solutions so explore solution one it will work.

Please upvote if you got some hint from my answer

Thanks

Aviral Bhardwaj

smkazim
New Contributor II

Hello All,

I am getting the exact issue as motioned in the first pot here. I have tried all the solutions listed: -

  1. Changing DBR to 7.3: Gave other errors related to libraries not present in that DBR version
  2. Using spark.read.parquet: This is giving "AnalysisException: Unable to infer schema for Parquet. It must be specified manually." error. I have checked the parquet files exists in that location and they are not empty.
  3. Exploring solutions folder: It is giving the same errors.

Any ideas what else I can try please.

Thanks.

vijaykumar99535
New Contributor III

I used spark.read.parquet and then convereted that to pandas dataframe and it worked for me.

Upvote if it helped you.

vijaykumar99535_0-1704360883621.png

 

jonathanchcc
New Contributor III

Thanks for sharing this helped me too  🤖

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.