Running Azure Databricks Enterprise DBR 8.3 ML running on a single node, with Python notebook.
I have 2 small Spark dataframes that I am able source via credential passthrough reading from ADLSgen2 via `abfss://` method and display the full content of the dataframe without any issues.
The save CSV operational completed successful. However, when I examine the CSV output directory, it seems to only store the pointers of the files, not the actual dataframe.
Here is the sequence of code cells from start to finish:
%sh
mkdir /data
type(smallDF1)
-- OUTPUT --
Out[29]: pyspark.sql.dataframe.DataFrame
smallDF1.count()
-- OUTPUT --
Out[27]: 264095
smallDF2.count()
-- OUTPUT --
Out[28]: 66024
smallDF1.coalesce(1).write.csv("file:///data/df1", header = 'true')
smallDF2.coalesce(1).write.csv("file:///data/df2", header = 'true')
%sh
ls -al /data/df1/
ls -al /data/df2/
-- OUTPUT --
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root 8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root 12 Sep 27 22:41 ._committed_2366694737653163888.crc
-rw-r--r-- 1 root root 0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root 112 Sep 27 22:41 _committed_2366694737653163888
total 20
drwxr-xr-x 2 root root 4096 Sep 27 22:41 .
drwxr-xr-x 8 root root 4096 Sep 27 22:41 ..
-rw-r--r-- 1 root root 8 Sep 27 22:41 ._SUCCESS.crc
-rw-r--r-- 1 root root 12 Sep 27 22:41 ._committed_114254853464039644.crc
-rw-r--r-- 1 root root 0 Sep 27 22:41 _SUCCESS
-rw-r--r-- 1 root root 111 Sep 27 22:41 _committed_114254853464039644
%sh
cat /data/train/_committed_2366694737653163888
-- OUTPUT --
{"added":["part-00000-tid-2366694737653163888-4b4ac3f3-9aa3-40f8-8710-cef6b958e3bc-32-1-c000.csv"],"removed":[]}
What am I missing here to be able to write a small CSV files?
I would like to read in these 2 CSV files using R.
Thank you for any tips / pointers / advice.