cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

writing spark dataframe as CSV to a repo

chari
Contributor

Hi,

I wrote a spark dataframe as csv to a repo (synced with github). But when I checked the folder, the file wasn't there. 

Here is my code:

spark_df.write.format('csv').option('header','true').mode('overwrite').save('/Repos/abcd/mno/data')
 
No error message ! Nevertheless no file inside the folder 'data'. In addition 'abcd' is my repo name, inside it is a folder 'mno' which contains 'data'.
-Thx
3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @chari

 

  1. Using the save method and choosing the .csv format, Spark will automatically create a folder containing multiple CSV files. Each file corresponds to a partition of your DataFrame, which is why you may have noticed a folder named "data" instead of a single CSV file. If your end goal is to have a single CSV file instead of a folder, there are a couple of approaches you can try: - Combine your partitions by repartitioning your DataFrame to have only one partition. This will result in a single output file when writing to a CSV. - Alternatively, you can directly write your DataFrame to a CSV file using the .csv format to achieve the desired outcome. Keep in mind that this will overwrite any previous files with the same name. Note that this approach collects all data to the driver node, so it’s suitable for small data frames that fit into memory.
  2. In situations where your DataFrame is small enough to fit into memory, consider converting it to a Pandas DataFrame and saving it as a CSV file. However, be mindful of using this method for large datasets as it may overload the driver node's memory. Be sure to choose the best approach for your specific scenario. As always, don't hesitate to reach out if you have any questions or require further assistance.

Hi @Kaniz ,

Thanks for the help. But my problem is that I do not see any files in the target folder. I am looking for options.

-Regards

feiyun0112
Contributor III

 the folder 'Repos' is not your repo, it's `dbfs:/Repos`, please check

dbutils.fs.ls('/Repos/abcd/mno/data')

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.