08-18-2021 09:11 AM
Hello all,
As described in the title, here's my problem:
1. I'm using databricks-connect in order to send jobs to a databricks cluster
2. The "local" environment is an AWS EC2
3. I want to read a CSV file that is in DBFS (databricks) with
pd.read_csv()
. Reason for that is that it's too big to do spark.read.csv()
and then .toPandas()
(crashes everytime).
4. When I run
pd.read_csv("/dbfs/FileStore/some_file")
I get a FileNotFoundError
because it points to the local S3 buckets rather than to dbfs. Is there a way to do what I want to do (e.g. change where pandas looks for files with some options)?
Thanks a lot in advance!
11-25-2021 12:18 AM
Hi,
After some research, I have found out that the pandas API reads only local files. This means that even if a read_csv command works in the Databricks Notebook environment, it will not work when using databricks-connect (pandas reads locally from within the notebook environment).
A work around is to use the pyspark spark.read.format('csv') API to read the remote files and append a ".toPandas()" at the end so that we get a pandas dataframe.
df_pandas = spark.read.format('csv').options(header='true').load('path/in/the/remote/dbfs/filesystem/').toPandas()
09-02-2021 02:45 AM
Hi @ hamzatazib96 ! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.
09-29-2021 04:09 AM
Hi,
what happens if you change it to below ?
pd.read_csv("file:/dbfs/FileStore/some_file")
10-28-2021 02:38 AM
Trying it with pd.read_excel does not help.
10-28-2021 02:38 AM
I am having a similar issue:
pd.read_excel("dbfs:/mnt/path/to/file.xls")
Has a solution been found for this?
11-05-2021 06:15 AM
Hi @venter2021, Did you try this?
pd.read_csv("/dbfs/mnt/path_to_file.csv")
pd.read_excel("/dbfs/mnt/path_to_file.xls")
11-23-2021 05:42 PM
I've tried, which doesn't work.
11-24-2021 06:45 AM
Hi @Yuanyue Liu , Which DBR version are you using?
11-24-2021 11:39 AM
Hi Fatma,
Thanks for asking.
I've tried 10.1 ML (includes Apache Spark 3.2.0, Scala 2.12) and 9.1 LTS (Scala 2.12, Spark 3.1.2) . Both of them don't work.
However, it works while I read it via spark. And I used display(dbutils.fs.ls("dbfs:/FileStore/tables/")) to test it, my file path(dbfs:/FileStore/tables/POS_CASH_balance.csv) exists. So I don't think it is the problem of the path or my code of pandas. I personally guess that the free version didn't support reading csv/files from dbfs via pandas directly, isn't it?
Here is the change of my code, and the change works
pd.read_csv('dbfs:/FileStore/tables/POS_CASH_balance.csv')-->spark.read.csv('dbfs:/FileStore/tables/POS_CASH_balance.csv)
Hope my experience could help others.
Cheers
01-04-2023 01:04 PM
DataBricks community edition 10.4 LTS ML (Apache Spark 3.2.1, Scala 2.12) has the same problem with pd.read_csv.
The spark.read statement replaces the original column names with (_c0, _c1,…), unless .option("header", true") is used.
The following forms should work:
path = 'dbfs:/FileStore/tables/POS_CASH_balance.csv'
spark.read
.option("header", "true")
.csv(path)
spark.read
.format("csv")
.option("header", "true")
.load(file_name)
11-24-2021 06:54 AM
Hi @Kaniz Fatma ,
I am having similar issues when using databricks-connect with Azure. I am not able to read data that is already mounted to dbfs (from a datalake gen2). The data is readable within the Azure Databricks Notebook environment but not from databricks-connect.
11-24-2021 06:55 AM
Hi @Arturo Amador , Please mention your DBR version.
11-24-2021 06:58 AM
Hi,
My DBR:
9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
11-24-2021 07:03 AM
@Kaniz Fatma ,
All tests in databricks-connect pass. I am also able to run the examples provided in the documentation (which do not read data from dbfs)
11-24-2021 07:44 AM
Hi @Arturo Amador , Would you like to share the changes you made in order to get the solution?
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.