cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to save Spark Dataframe to driver node's local file system as CSV file

dataslicer
Contributor

Running Azure Databricks Enterprise DBR 8.3 ML running on a single node, with Python notebook.

I have 2 small Spark dataframes that I am able source via credential passthrough reading from ADLSgen2 via `abfss://` method and display the full content of the dataframe without any issues.

The save CSV operational completed successful. However, when I examine the CSV output directory, it seems to only store the pointers of the files, not the actual dataframe.

Here is the sequence of code cells from start to finish:

%sh
mkdir /data
 
type(smallDF1)
 
-- OUTPUT --
Out[29]: pyspark.sql.dataframe.DataFrame
 
smallDF1.count()
 
-- OUTPUT --
Out[27]: 264095
 
smallDF2.count()
 
-- OUTPUT --
Out[28]: 66024
  
smallDF1.coalesce(1).write.csv("file:///data/df1", header = 'true')
smallDF2.coalesce(1).write.csv("file:///data/df2", header = 'true')
 
%sh
ls -al /data/df1/
ls -al /data/df2/
  
-- OUTPUT --
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_2366694737653163888.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  112 Sep 27 22:41 _committed_2366694737653163888
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
 -rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_114254853464039644.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  111 Sep 27 22:41 _committed_114254853464039644
 
%sh
cat /data/train/_committed_2366694737653163888
 
-- OUTPUT --
{"added":["part-00000-tid-2366694737653163888-4b4ac3f3-9aa3-40f8-8710-cef6b958e3bc-32-1-c000.csv"],"removed":[]}

What am I missing here to be able to write a small CSV files?

I would like to read in these 2 CSV files using R.

Thank you for any tips / pointers / advice.

1 ACCEPTED SOLUTION

Accepted Solutions

DouglasLinder
New Contributor III

Maybe someone else can answer you, but I thought this was a limitation of spark; it cannot write outside the dbfs.

I use:

df.toPandas().to_csv("/tmp/foo.csv")

To do this for small files.

For large files, write it to a dbfs path, and then use the shell to copy /dbfs/foo/partXXXX.csv out of dbfs.

View solution in original post

4 REPLIES 4

DouglasLinder
New Contributor III

Maybe someone else can answer you, but I thought this was a limitation of spark; it cannot write outside the dbfs.

I use:

df.toPandas().to_csv("/tmp/foo.csv")

To do this for small files.

For large files, write it to a dbfs path, and then use the shell to copy /dbfs/foo/partXXXX.csv out of dbfs.

dataslicer
Contributor

Thank you for both of these awesome great answers!

They work!

dazfuller
Contributor III

There shouldn't be a need to move these outside of dbfs. Ideally you want to write to something like "dbfs:/FileStore/training/df1". Then if you want to access them from something that does not understand the dbfs file system, just access it using a straight posix path like "/dbfs/FileStore/training/df1/partxxxx.csv"

Dan_Z
Honored Contributor
Honored Contributor

Modern Spark operates by a design choice to separate storage and compute. So saving a csv to the river's local disk doesn't make sense for a few reasons:

  • the worker nodes don't have access to the driver's disk. They would need to send the data over to the driver, which is slow, burdensome, and could cause memory/IO issues.
  • Spark is designed to write to Hadoop-inspired file systems, like DBFS, S3, Azure Blob/Gen2, etc. That way, the workers can write concurrently.

To do it your way, you could just collect the results to the driver, like using yourDF.toPandas() then save out the pandas data frame to local driver disk. Please note, if you take down the cluster you will lose anything on the local disk. Local disk should just be used as a tmp location, if at all.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.