<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Unable to save Spark Dataframe to driver node's local file system as CSV file in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14282#M8802</link>
    <description>&lt;P&gt;Running Azure Databricks Enterprise DBR 8.3 ML running on a single node, with Python notebook.  &lt;/P&gt;&lt;P&gt;I have 2 small Spark dataframes that I am able source via credential passthrough reading from ADLSgen2 via `abfss://` method and display the full content of the dataframe without any issues. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The save CSV operational completed successful.  However, when I examine the CSV output directory, it seems to only store the pointers of the files, not the actual dataframe.  &lt;/P&gt;&lt;P&gt;Here is the sequence of code cells from start to finish:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%sh
mkdir /data
&amp;nbsp;
type(smallDF1)
&amp;nbsp;
-- OUTPUT --
Out[29]: pyspark.sql.dataframe.DataFrame
&amp;nbsp;
smallDF1.count()
&amp;nbsp;
-- OUTPUT --
Out[27]: 264095
&amp;nbsp;
smallDF2.count()
&amp;nbsp;
-- OUTPUT --
Out[28]: 66024
&amp;nbsp;&amp;nbsp;
smallDF1.coalesce(1).write.csv("file:///data/df1", header = 'true')
smallDF2.coalesce(1).write.csv("file:///data/df2", header = 'true')
&amp;nbsp;
%sh
ls -al /data/df1/
ls -al /data/df2/
&amp;nbsp;&amp;nbsp;
-- OUTPUT --
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_2366694737653163888.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  112 Sep 27 22:41 _committed_2366694737653163888
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
&amp;nbsp;-rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_114254853464039644.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  111 Sep 27 22:41 _committed_114254853464039644
&amp;nbsp;
%sh
cat /data/train/_committed_2366694737653163888
&amp;nbsp;
-- OUTPUT --
{"added":["part-00000-tid-2366694737653163888-4b4ac3f3-9aa3-40f8-8710-cef6b958e3bc-32-1-c000.csv"],"removed":[]}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What am I missing here to be able to write a small CSV files?  &lt;/P&gt;&lt;P&gt;I would like to read in these 2 CSV files using R.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for any tips / pointers / advice.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 27 Sep 2021 23:16:50 GMT</pubDate>
    <dc:creator>dataslicer</dc:creator>
    <dc:date>2021-09-27T23:16:50Z</dc:date>
    <item>
      <title>Unable to save Spark Dataframe to driver node's local file system as CSV file</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14282#M8802</link>
      <description>&lt;P&gt;Running Azure Databricks Enterprise DBR 8.3 ML running on a single node, with Python notebook.  &lt;/P&gt;&lt;P&gt;I have 2 small Spark dataframes that I am able source via credential passthrough reading from ADLSgen2 via `abfss://` method and display the full content of the dataframe without any issues. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The save CSV operational completed successful.  However, when I examine the CSV output directory, it seems to only store the pointers of the files, not the actual dataframe.  &lt;/P&gt;&lt;P&gt;Here is the sequence of code cells from start to finish:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%sh
mkdir /data
&amp;nbsp;
type(smallDF1)
&amp;nbsp;
-- OUTPUT --
Out[29]: pyspark.sql.dataframe.DataFrame
&amp;nbsp;
smallDF1.count()
&amp;nbsp;
-- OUTPUT --
Out[27]: 264095
&amp;nbsp;
smallDF2.count()
&amp;nbsp;
-- OUTPUT --
Out[28]: 66024
&amp;nbsp;&amp;nbsp;
smallDF1.coalesce(1).write.csv("file:///data/df1", header = 'true')
smallDF2.coalesce(1).write.csv("file:///data/df2", header = 'true')
&amp;nbsp;
%sh
ls -al /data/df1/
ls -al /data/df2/
&amp;nbsp;&amp;nbsp;
-- OUTPUT --
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_2366694737653163888.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  112 Sep 27 22:41 _committed_2366694737653163888
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
&amp;nbsp;-rw-r--r-- 1 root root    8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root   12 Sep 27 22:41 ._committed_114254853464039644.crc
-rw-r--r-- 1 root root    0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root  111 Sep 27 22:41 _committed_114254853464039644
&amp;nbsp;
%sh
cat /data/train/_committed_2366694737653163888
&amp;nbsp;
-- OUTPUT --
{"added":["part-00000-tid-2366694737653163888-4b4ac3f3-9aa3-40f8-8710-cef6b958e3bc-32-1-c000.csv"],"removed":[]}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What am I missing here to be able to write a small CSV files?  &lt;/P&gt;&lt;P&gt;I would like to read in these 2 CSV files using R.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for any tips / pointers / advice.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Sep 2021 23:16:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14282#M8802</guid>
      <dc:creator>dataslicer</dc:creator>
      <dc:date>2021-09-27T23:16:50Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to save Spark Dataframe to driver node's local file system as CSV file</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14283#M8803</link>
      <description>&lt;P&gt;Maybe someone else can answer you, but I thought this was a limitation of spark; it cannot write outside the dbfs.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I use:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.toPandas().to_csv("/tmp/foo.csv")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;To do this for small files.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For large files, write it to a dbfs path, and then use the shell to copy /dbfs/foo/partXXXX.csv out of dbfs.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Sep 2021 03:41:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14283#M8803</guid>
      <dc:creator>DouglasLinder</dc:creator>
      <dc:date>2021-09-28T03:41:50Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to save Spark Dataframe to driver node's local file system as CSV file</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14284#M8804</link>
      <description>&lt;P&gt;Thank you for both of these awesome great answers!  &lt;/P&gt;&lt;P&gt;They work!&lt;/P&gt;</description>
      <pubDate>Tue, 28 Sep 2021 06:14:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14284#M8804</guid>
      <dc:creator>dataslicer</dc:creator>
      <dc:date>2021-09-28T06:14:00Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to save Spark Dataframe to driver node's local file system as CSV file</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14285#M8805</link>
      <description>&lt;P&gt;There shouldn't be a need to move these outside of dbfs. Ideally you want to write to something like "dbfs:/FileStore/training/df1". Then if you want to access them from something that does not understand the dbfs file system, just access it using a straight posix path like "/dbfs/FileStore/training/df1/partxxxx.csv"&lt;/P&gt;</description>
      <pubDate>Tue, 28 Sep 2021 13:38:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14285#M8805</guid>
      <dc:creator>dazfuller</dc:creator>
      <dc:date>2021-09-28T13:38:38Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to save Spark Dataframe to driver node's local file system as CSV file</title>
      <link>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14286#M8806</link>
      <description>&lt;P&gt;Modern Spark operates by a design choice to separate storage and compute. So saving a csv to the river's local disk doesn't make sense for a few reasons:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;the worker nodes don't have access to the driver's disk. They would need to send the data over to the driver, which is slow, burdensome, and could cause memory/IO issues.&lt;/LI&gt;&lt;LI&gt;Spark is designed to write to Hadoop-inspired file systems, like DBFS, S3, Azure Blob/Gen2, etc. That way, the workers can write concurrently. &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To do it your way, you could just collect the results to the driver, like using yourDF.toPandas() then save out the pandas data frame to local driver disk. Please note, if you take down the cluster you will lose anything on the local disk. Local disk should just be used as a tmp location, if at all.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 20:41:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/unable-to-save-spark-dataframe-to-driver-node-s-local-file/m-p/14286#M8806</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-12T20:41:59Z</dc:date>
    </item>
  </channel>
</rss>

