- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-18-2021 09:11 AM
Hello all,
As described in the title, here's my problem:
1. I'm using databricks-connect in order to send jobs to a databricks cluster
2. The "local" environment is an AWS EC2
3. I want to read a CSV file that is in DBFS (databricks) with
pd.read_csv()
. Reason for that is that it's too big to do spark.read.csv()
and then .toPandas()
(crashes everytime).
4. When I run
pd.read_csv("/dbfs/FileStore/some_file")
I get a FileNotFoundError
because it points to the local S3 buckets rather than to dbfs. Is there a way to do what I want to do (e.g. change where pandas looks for files with some options)?
Thanks a lot in advance!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-25-2021 12:18 AM
Hi,
After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).
A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.
df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-29-2021 04:09 AM
Hi,
what happens if you change it to below ?
pd.read_csv("file:/dbfs/FileStore/some_file")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2021 02:38 AM
Trying it with pd.read_excel does not help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2021 02:38 AM
I am having a similar issue:
- I am running databricks-connect from within a docker container
- I have a .xls file stored in Azure File storage, which is mounted to dbfs
- I would like to read this excel file with
pd.read_excel("dbfs:/mnt/path/to/file.xls")
Has a solution been found for this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-23-2021 05:42 PM
I've tried, which doesn't work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2021 11:39 AM
Hi Fatma,
Thanks for asking.
I've tried 10.1 ML (includes Apache Spark 3.2.0, Scala 2.12) and 9.1 LTS (Scala 2.12, Spark 3.1.2) . Both of them don't work.
However, it works while I read it via spark. And I used display(dbutils.fs.ls("dbfs:/FileStore/tables/")) to test it, my file path(dbfs:/FileStore/tables/POS_CASH_balance.csv) exists. So I don't think it is the problem of the path or my code of pandas. I personally guess that the free version didn't support reading csv/files from dbfs via pandas directly, isn't it?
Here is the change of my code, and the change works
pd.read_csv('dbfs:/FileStore/tables/POS_CASH_balance.csv')-->spark.read.csv('dbfs:/FileStore/tables/POS_CASH_balance.csv)
Hope my experience could help others.
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-04-2023 01:04 PM
DataBricks community edition 10.4 LTS ML (Apache Spark 3.2.1, Scala 2.12) has the same problem with pd.read_csv.
The spark.read statement replaces the original column names with (_c0, _c1,…), unless .option("header", true") is used.
The following forms should work:
path = 'dbfs:/FileStore/tables/POS_CASH_balance.csv'
spark.read
.option("header", "true")
.csv(path)
spark.read
.format("csv")
.option("header", "true")
.load(file_name)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2021 06:54 AM
Hi @Kaniz Fatma ,
I am having similar issues when using databricks-connect with Azure. I am not able to read data that is already mounted to dbfs (from a datalake gen2). The data is readable within the Azure Databricks Notebook environment but not from databricks-connect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2021 06:58 AM
Hi,
My DBR:
9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2021 07:03 AM
@Kaniz Fatma ,
All tests in databricks-connect pass. I am also able to run the examples provided in the documentation (which do not read data from dbfs)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2021 07:49 AM
Hi @Kaniz Fatma ,
No, I still haven't found the solution and I can't read from dbfs (not with pandas.read_csv).
I meant to say that the setup tests pass, so the issue is not in the setup)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2021 07:58 AM
Hi @Kaniz Fatma ,
I will try that and report!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2021 08:25 AM
Hi @Kaniz Fatma ,
I can confirm that after downgrading to the DBR 6.4, and passing all the tests in:
databricks-connect test
I am still getting the FileNotFound error when trying to use
pd.read_csv('/dbfs/mnt/datalake_gen2_data/some.csv'')
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-25-2021 12:18 AM
Hi,
After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).
A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.
df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-25-2021 10:09 AM
Hi Arturooa,
It seems we are holding a similar conclusion. Just a quick question, what do you mean for 'local files'? I've uploaded my files into dbfs, are they not local files after that?
Thanks