09-27-2021 04:16 PM
Running Azure Databricks Enterprise DBR 8.3 ML running on a single node, with Python notebook.
I have 2 small Spark dataframes that I am able source via credential passthrough reading from ADLSgen2 via `abfss://` method and display the full content of the dataframe without any issues.
The save CSV operational completed successful. However, when I examine the CSV output directory, it seems to only store the pointers of the files, not the actual dataframe.
Here is the sequence of code cells from start to finish:
%sh
mkdir /data
type(smallDF1)
-- OUTPUT --
Out[29]: pyspark.sql.dataframe.DataFrame
smallDF1.count()
-- OUTPUT --
Out[27]: 264095
smallDF2.count()
-- OUTPUT --
Out[28]: 66024
smallDF1.coalesce(1).write.csv("file:///data/df1", header = 'true')
smallDF2.coalesce(1).write.csv("file:///data/df2", header = 'true')
%sh
ls -al /data/df1/
ls -al /data/df2/
-- OUTPUT --
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root 8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root 12 Sep 27 22:41 ._committed_2366694737653163888.crc
-rw-r--r-- 1 root root 0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root 112 Sep 27 22:41 _committed_2366694737653163888
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root 8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root 12 Sep 27 22:41 ._committed_114254853464039644.crc
-rw-r--r-- 1 root root 0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root 111 Sep 27 22:41 _committed_114254853464039644
%sh
cat /data/train/_committed_2366694737653163888
-- OUTPUT --
{"added":["part-00000-tid-2366694737653163888-4b4ac3f3-9aa3-40f8-8710-cef6b958e3bc-32-1-c000.csv"],"removed":[]}
What am I missing here to be able to write a small CSV files?
I would like to read in these 2 CSV files using R.
Thank you for any tips / pointers / advice.
09-27-2021 08:41 PM
Maybe someone else can answer you, but I thought this was a limitation of spark; it cannot write outside the dbfs.
I use:
df.toPandas().to_csv("/tmp/foo.csv")
To do this for small files.
For large files, write it to a dbfs path, and then use the shell to copy /dbfs/foo/partXXXX.csv out of dbfs.
09-27-2021 08:41 PM
Maybe someone else can answer you, but I thought this was a limitation of spark; it cannot write outside the dbfs.
I use:
df.toPandas().to_csv("/tmp/foo.csv")
To do this for small files.
For large files, write it to a dbfs path, and then use the shell to copy /dbfs/foo/partXXXX.csv out of dbfs.
09-27-2021 11:14 PM
Thank you for both of these awesome great answers!
They work!
09-28-2021 06:38 AM
There shouldn't be a need to move these outside of dbfs. Ideally you want to write to something like "dbfs:/FileStore/training/df1". Then if you want to access them from something that does not understand the dbfs file system, just access it using a straight posix path like "/dbfs/FileStore/training/df1/partxxxx.csv"
10-12-2021 01:41 PM
Modern Spark operates by a design choice to separate storage and compute. So saving a csv to the river's local disk doesn't make sense for a few reasons:
To do it your way, you could just collect the results to the driver, like using yourDF.toPandas() then save out the pandas data frame to local driver disk. Please note, if you take down the cluster you will lose anything on the local disk. Local disk should just be used as a tmp location, if at all.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group