Databricks Community

dataslicer · ‎09-27-2021

Running Azure Databricks Enterprise DBR 8.3 ML running on a single node, with Python notebook.

I have 2 small Spark dataframes that I am able source via credential passthrough reading from ADLSgen2 via `abfss://` method and display the full content of the dataframe without any issues.

The save CSV operational completed successful. However, when I examine the CSV output directory, it seems to only store the pointers of the files, not the actual dataframe.

Here is the sequence of code cells from start to finish:

%sh
mkdir /data
 
type(smallDF1)
 
-- OUTPUT --
Out[29]: pyspark.sql.dataframe.DataFrame
 
smallDF1.count()
 
-- OUTPUT --
Out[27]: 264095
 
smallDF2.count()
 
-- OUTPUT --
Out[28]: 66024
  
smallDF1.coalesce(1).write.csv("file:///data/df1", header = 'true')
smallDF2.coalesce(1).write.csv("file:///data/df2", header = 'true')
 
%sh
ls -al /data/df1/
ls -al /data/df2/
  
-- OUTPUT --
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_2366694737653163888.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  112 Sep 27 22:41 _committed_2366694737653163888
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
 -rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_114254853464039644.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  111 Sep 27 22:41 _committed_114254853464039644
 
%sh
cat /data/train/_committed_2366694737653163888
 
-- OUTPUT --
{"added":["part-00000-tid-2366694737653163888-4b4ac3f3-9aa3-40f8-8710-cef6b958e3bc-32-1-c000.csv"],"removed":[]}

What am I missing here to be able to write a small CSV files?

I would like to read in these 2 CSV files using R.

Thank you for any tips / pointers / advice.

DouglasLinder · ‎09-27-2021

Maybe someone else can answer you, but I thought this was a limitation of spark; it cannot write outside the dbfs.

I use:

df.toPandas().to_csv("/tmp/foo.csv")

To do this for small files.

For large files, write it to a dbfs path, and then use the shell to copy /dbfs/foo/partXXXX.csv out of dbfs.

View solution in original post

DouglasLinder · ‎09-27-2021

Maybe someone else can answer you, but I thought this was a limitation of spark; it cannot write outside the dbfs.

I use:

df.toPandas().to_csv("/tmp/foo.csv")

To do this for small files.

For large files, write it to a dbfs path, and then use the shell to copy /dbfs/foo/partXXXX.csv out of dbfs.

dataslicer · ‎09-27-2021

Thank you for both of these awesome great answers!

They work!

dazfuller · ‎09-28-2021

There shouldn't be a need to move these outside of dbfs. Ideally you want to write to something like "dbfs:/FileStore/training/df1". Then if you want to access them from something that does not understand the dbfs file system, just access it using a straight posix path like "/dbfs/FileStore/training/df1/partxxxx.csv"

Dan_Z · ‎10-12-2021

Modern Spark operates by a design choice to separate storage and compute. So saving a csv to the river's local disk doesn't make sense for a few reasons:

the worker nodes don't have access to the driver's disk. They would need to send the data over to the driver, which is slow, burdensome, and could cause memory/IO issues.
Spark is designed to write to Hadoop-inspired file systems, like DBFS, S3, Azure Blob/Gen2, etc. That way, the workers can write concurrently.

To do it your way, you could just collect the results to the driver, like using yourDF.toPandas() then save out the pandas data frame to local driver disk. Please note, if you take down the cluster you will lose anything on the local disk. Local disk should just be used as a tmp location, if at all.

Databricks Community

Unable to save Spark Dataframe to driver node's local file system as CSV file

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon