cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Read file from dbfs with pd.read_csv() using databricks-connect

hamzatazib96
New Contributor III

Hello all,

As described in the title, here's my problem:

1. I'm using databricks-connect in order to send jobs to a databricks cluster

2. The "local" environment is an AWS EC2

3. I want to read a CSV file that is in DBFS (databricks) with

pd.read_csv()
. Reason for that is that it's too big to do
spark.read.csv()
and then
.toPandas()
(crashes everytime).

4. When I run

pd.read_csv("/dbfs/FileStore/some_file")
I get a
FileNotFoundError
because it points to the local S3 buckets rather than to dbfs. Is there a way to do what I want to do (e.g. change where pandas looks for files with some options)?

Thanks a lot in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

Hi,

After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).

A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.

df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()

View solution in original post

28 REPLIES 28

Kaniz
Community Manager
Community Manager

Hi @ hamzatazib96 ! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.

User16763506586
Contributor

Hi,

what happens if you change it to below ?

pd.read_csv("file:/dbfs/FileStore/some_file")

Trying it with pd.read_excel does not help.

venter2021
New Contributor II

I am having a similar issue:

  • I am running databricks-connect from within a docker container
  • I have a .xls file stored in Azure File storage, which is mounted to dbfs
  • I would like to read this excel file with
pd.read_excel("dbfs:/mnt/path/to/file.xls")

Has a solution been found for this?

Kaniz
Community Manager
Community Manager

Hi @venter2021, Did you try this?

pd.read_csv("/dbfs/mnt/path_to_file.csv")
pd.read_excel("/dbfs/mnt/path_to_file.xls")

Student185
New Contributor III

I've tried, which doesn't work.

Kaniz
Community Manager
Community Manager

Hi @Yuanyue Liu​ , Which DBR version are you using?

Student185
New Contributor III

Hi Fatma,

Thanks for asking.

I've tried 10.1 ML (includes Apache Spark 3.2.0, Scala 2.12) and 9.1 LTS (Scala 2.12, Spark 3.1.2) . Both of them don't work.

However, it works while I read it via spark. And I used display(dbutils.fs.ls("dbfs:/FileStore/tables/")) to test it, my file path(dbfs:/FileStore/tables/POS_CASH_balance.csv) exists. So I don't think it is the problem of the path or my code of pandas. I personally guess that the free version didn't support reading csv/files from dbfs via pandas directly, isn't it?

Here is the change of my code, and the change works

pd.read_csv('dbfs:/FileStore/tables/POS_CASH_balance.csv')-->spark.read.csv('dbfs:/FileStore/tables/POS_CASH_balance.csv)

Hope my experience could help others.

Cheers

martud
New Contributor II

DataBricks community edition 10.4 LTS ML (Apache Spark 3.2.1, Scala 2.12) has the same problem with pd.read_csv.

The spark.read statement replaces the original column names with (_c0, _c1,…), unless .option("header", true") is used.

The following forms should work:

path = 'dbfs:/FileStore/tables/POS_CASH_balance.csv'
spark.read
.option("header", "true")
.csv(path)
spark.read
.format("csv")
.option("header", "true")
.load(file_name)

Anonymous
Not applicable

Hi @Kaniz Fatma​ ,

I am having similar issues when using databricks-connect with Azure. I am not able to read data that is already mounted to dbfs (from a datalake gen2). The data is readable within the Azure Databricks Notebook environment but not from databricks-connect.

Kaniz
Community Manager
Community Manager

Hi @Arturo Amador​ , Please mention your DBR version.

Anonymous
Not applicable

Hi,

My DBR:

9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)

Anonymous
Not applicable

@Kaniz Fatma​ ,

All tests in databricks-connect pass. I am also able to run the examples provided in the documentation (which do not read data from dbfs)

Kaniz
Community Manager
Community Manager

Hi @Arturo Amador​ , Would you like to share the changes you made in order to get the solution?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.