Databricks Community

hamzatazib96 · ‎08-18-2021

Hello all,

As described in the title, here's my problem:

1. I'm using databricks-connect in order to send jobs to a databricks cluster

2. The "local" environment is an AWS EC2

3. I want to read a CSV file that is in DBFS (databricks) with

pd.read_csv()

. Reason for that is that it's too big to do

spark.read.csv()

and then

.toPandas()

(crashes everytime).

4. When I run

pd.read_csv("/dbfs/FileStore/some_file")

I get a

FileNotFoundError

because it points to the local S3 buckets rather than to dbfs. Is there a way to do what I want to do (e.g. change where pandas looks for files with some options)?

Thanks a lot in advance!

Anonymous · ‎11-25-2021

Hi,

After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).

A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.

df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()

View solution in original post

User16763506586 · ‎09-29-2021

Hi,

what happens if you change it to below ?

pd.read_csv("file:/dbfs/FileStore/some_file")

venter2021 · ‎10-28-2021

Trying it with pd.read_excel does not help.

venter2021 · ‎10-28-2021

I am having a similar issue:

I am running databricks-connect from within a docker container
I have a .xls file stored in Azure File storage, which is mounted to dbfs
I would like to read this excel file with

pd.read_excel("dbfs:/mnt/path/to/file.xls")

Has a solution been found for this?

Student185 · ‎11-23-2021

I've tried, which doesn't work.

Student185 · ‎11-24-2021

Hi Fatma,

Thanks for asking.

I've tried 10.1 ML (includes Apache Spark 3.2.0, Scala 2.12) and 9.1 LTS (Scala 2.12, Spark 3.1.2) . Both of them don't work.

However, it works while I read it via spark. And I used display(dbutils.fs.ls("dbfs:/FileStore/tables/")) to test it, my file path(dbfs:/FileStore/tables/POS_CASH_balance.csv) exists. So I don't think it is the problem of the path or my code of pandas. I personally guess that the free version didn't support reading csv/files from dbfs via pandas directly, isn't it?

Here is the change of my code, and the change works

pd.read_csv('dbfs:/FileStore/tables/POS_CASH_balance.csv')-->spark.read.csv('dbfs:/FileStore/tables/POS_CASH_balance.csv)

Hope my experience could help others.

Cheers

martud · ‎01-04-2023

DataBricks community edition 10.4 LTS ML (Apache Spark 3.2.1, Scala 2.12) has the same problem with pd.read_csv.

The spark.read statement replaces the original column names with (_c0, _c1,…), unless .option("header", true") is used.

The following forms should work:

path = 'dbfs:/FileStore/tables/POS_CASH_balance.csv'

spark.read
.option("header", "true")
.csv(path)

spark.read
.format("csv")
.option("header", "true")
.load(file_name)

Anonymous · ‎11-24-2021

Hi @Kaniz Fatma ,

I am having similar issues when using databricks-connect with Azure. I am not able to read data that is already mounted to dbfs (from a datalake gen2). The data is readable within the Azure Databricks Notebook environment but not from databricks-connect.

Anonymous · ‎11-24-2021

Hi,

My DBR:

9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)

Anonymous · ‎11-24-2021

@Kaniz Fatma ,

All tests in databricks-connect pass. I am also able to run the examples provided in the documentation (which do not read data from dbfs)

Anonymous · ‎11-24-2021

Hi @Kaniz Fatma ,

No, I still haven't found the solution and I can't read from dbfs (not with pandas.read_csv).

I meant to say that the setup tests pass, so the issue is not in the setup)

Anonymous · ‎11-24-2021

Hi @Kaniz Fatma ,

I will try that and report!

Anonymous · ‎11-24-2021

Hi @Kaniz Fatma ,

I can confirm that after downgrading to the DBR 6.4, and passing all the tests in:

databricks-connect test

I am still getting the FileNotFound error when trying to use

pd.read_csv('/dbfs/mnt/datalake_gen2_data/some.csv'')

Anonymous · ‎11-25-2021

Hi,

After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).

A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.

df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()

Student185 · ‎11-25-2021

Hi Arturooa,

It seems we are holding a similar conclusion. Just a quick question, what do you mean for 'local files'? I've uploaded my files into dbfs, are they not local files after that?

Thanks