topic Unable to save Spark Dataframe to driver node's local file system as CSV file in Data Engineering

Unable to save Spark Dataframe to driver node's local file system as CSV file

dataslicer — Mon, 27 Sep 2021 23:16:50 GMT

Running Azure Databricks Enterprise DBR 8.3 ML running on a single node, with Python notebook.

I have 2 small Spark dataframes that I am able source via credential passthrough reading from ADLSgen2 via `abfss://` method and display the full content of the dataframe without any issues.

The save CSV operational completed successful. However, when I examine the CSV output directory, it seems to only store the pointers of the files, not the actual dataframe.

Here is the sequence of code cells from start to finish:

%sh
mkdir /data
 
type(smallDF1)
 
-- OUTPUT --
Out[29]: pyspark.sql.dataframe.DataFrame
 
smallDF1.count()
 
-- OUTPUT --
Out[27]: 264095
 
smallDF2.count()
 
-- OUTPUT --
Out[28]: 66024
  
smallDF1.coalesce(1).write.csv("file:///data/df1", header = 'true')
smallDF2.coalesce(1).write.csv("file:///data/df2", header = 'true')
 
%sh
ls -al /data/df1/
ls -al /data/df2/
  
-- OUTPUT --
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_2366694737653163888.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  112 Sep 27 22:41 _committed_2366694737653163888
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
 -rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_114254853464039644.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  111 Sep 27 22:41 _committed_114254853464039644
 
%sh
cat /data/train/_committed_2366694737653163888
 
-- OUTPUT --
{"added":["part-00000-tid-2366694737653163888-4b4ac3f3-9aa3-40f8-8710-cef6b958e3bc-32-1-c000.csv"],"removed":[]}

What am I missing here to be able to write a small CSV files?

I would like to read in these 2 CSV files using R.

Thank you for any tips / pointers / advice.

Re: Unable to save Spark Dataframe to driver node's local file system as CSV file

DouglasLinder — Tue, 28 Sep 2021 03:41:50 GMT

Maybe someone else can answer you, but I thought this was a limitation of spark; it cannot write outside the dbfs.

I use:

df.toPandas().to_csv("/tmp/foo.csv")

To do this for small files.

For large files, write it to a dbfs path, and then use the shell to copy /dbfs/foo/partXXXX.csv out of dbfs.

Re: Unable to save Spark Dataframe to driver node's local file system as CSV file

dataslicer — Tue, 28 Sep 2021 06:14:00 GMT

Thank you for both of these awesome great answers!

They work!

Re: Unable to save Spark Dataframe to driver node's local file system as CSV file

dazfuller — Tue, 28 Sep 2021 13:38:38 GMT

There shouldn't be a need to move these outside of dbfs. Ideally you want to write to something like "dbfs:/FileStore/training/df1". Then if you want to access them from something that does not understand the dbfs file system, just access it using a straight posix path like "/dbfs/FileStore/training/df1/partxxxx.csv"

Re: Unable to save Spark Dataframe to driver node's local file system as CSV file

Dan_Z — Tue, 12 Oct 2021 20:41:59 GMT

Modern Spark operates by a design choice to separate storage and compute. So saving a csv to the river's local disk doesn't make sense for a few reasons:

the worker nodes don't have access to the driver's disk. They would need to send the data over to the driver, which is slow, burdensome, and could cause memory/IO issues.
Spark is designed to write to Hadoop-inspired file systems, like DBFS, S3, Azure Blob/Gen2, etc. That way, the workers can write concurrently.

To do it your way, you could just collect the results to the driver, like using yourDF.toPandas() then save out the pandas data frame to local driver disk. Please note, if you take down the cluster you will lose anything on the local disk. Local disk should just be used as a tmp location, if at all.