โ08-18-2021 09:11 AM
Hello all,
As described in the title, here's my problem:
1. I'm using databricks-connect in order to send jobs to a databricks cluster
2. The "local" environment is an AWS EC2
3. I want to read a CSV file that is in DBFS (databricks) with
pd.read_csv()
. Reason for that is that it's too big to do spark.read.csv()
and then .toPandas()
(crashes everytime).
4. When I run
pd.read_csv("/dbfs/FileStore/some_file")
I get a FileNotFoundError
because it points to the local S3 buckets rather than to dbfs. Is there a way to do what I want to do (e.g. change where pandas looks for files with some options)?
Thanks a lot in advance!
โ11-25-2021 12:18 AM
Hi,
โ
After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).
โ
A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.
df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()
โ09-29-2021 04:09 AM
Hi,
what happens if you change it to below ?
pd.read_csv("file:/dbfs/FileStore/some_file")
โ10-28-2021 02:38 AM
Trying it with pd.read_excel does not help.
โ10-28-2021 02:38 AM
I am having a similar issue:
pd.read_excel("dbfs:/mnt/path/to/file.xls")
Has a solution been found for this?
โ11-23-2021 05:42 PM
I've tried, which doesn't work.
โ11-24-2021 11:39 AM
Hi Fatma,
Thanks for asking.
I've tried 10.1 ML (includes Apache Spark 3.2.0, Scala 2.12) and 9.1 LTS (Scala 2.12, Spark 3.1.2) . Both of them don't work.
However, it works while I read it via spark. And I used display(dbutils.fs.ls("dbfs:/FileStore/tables/")) to test it, my file path(dbfs:/FileStore/tables/POS_CASH_balance.csv) exists. So I don't think it is the problem of the path or my code of pandas. I personally guess that the free version didn't support reading csv/files from dbfs via pandas directly, isn't it?
Here is the change of my code, and the change works
pd.read_csv('dbfs:/FileStore/tables/POS_CASH_balance.csv')-->spark.read.csv('dbfs:/FileStore/tables/POS_CASH_balance.csv)
Hope my experience could help others.
Cheers
โ01-04-2023 01:04 PM
DataBricks community edition 10.4 LTS ML (Apache Spark 3.2.1, Scala 2.12) has the same problem with pd.read_csv.
The spark.read statement replaces the original column names with (_c0, _c1,โฆ), unless .option("header", true") is used.
The following forms should work:
path = 'dbfs:/FileStore/tables/POS_CASH_balance.csv'
spark.read
.option("header", "true")
.csv(path)
spark.read
.format("csv")
.option("header", "true")
.load(file_name)
โ11-24-2021 06:54 AM
Hi @Kaniz Fatmaโ ,
I am having similar issues when using databricks-connect with Azure. I am not able to read data that is already mounted to dbfs (from a datalake gen2). The data is readable within the Azure Databricks Notebook environment but not from databricks-connect.
โ11-24-2021 06:58 AM
Hi,
My DBR:
9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
โ11-24-2021 07:03 AM
@Kaniz Fatmaโ ,
All tests in databricks-connect pass. I am also able to run the examples provided in the documentation (which do not read data from dbfs)
โ11-24-2021 07:49 AM
Hi @Kaniz Fatmaโ ,
โ
No, I still haven't found the solution and I can't read from dbfsโ (not with pandas.read_csv).
โ
I meant to say that the setup tests pass, so the issue is not in the setup)
โ
โ11-24-2021 07:58 AM
Hi @Kaniz Fatmaโ ,
โ
I will try that and report!โ
โ11-24-2021 08:25 AM
Hi @Kaniz Fatmaโ ,
I can confirm that after downgrading to the DBR 6.4, and passing all the tests in:
databricks-connect test
I am still getting the FileNotFound error when trying to use
pd.read_csv('/dbfs/mnt/datalake_gen2_data/some.csv'')
โ11-25-2021 12:18 AM
Hi,
โ
After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).
โ
A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.
df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()
โ11-25-2021 10:09 AM
Hi Arturooa,
It seems we are holding a similar conclusion. Just a quick question, what do you mean for 'local files'? I've uploaded my files into dbfs, are they not local files after that?
Thanks
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group