Write in Single CSV file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-03-2022 09:54 AM
We are reading 520GB partitions files from CSV and when we write in a Single CSV using repartition(1) it is taking 25+ hours. please let us know an optimized way to create a single CSV file so that our process could complete within 5 hours.
- Labels:
-
Single CSV File
-
Source Data Size
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-03-2022 01:01 PM
If you repartition(1), only one core of your whole cluster works. Please use repartition to the number of cores (SparkContext.DefaultParallelism).
After writing, you will get one file per core, so please use other software to merge files if you want to have only one (ADF has some excellent options for that in copy).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2022 10:50 PM
Thank you for your time and support, is there any other effective method to combine part CSV files into a single CSV file in databricks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-14-2022 04:32 AM
The method in databricks is one that you are using and is slow (repartition(1)).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2022 03:57 PM
You can use coalesce(1) for example:
df.coalesce(1).write.option("header","true").csv("path_to_save_your_CSV")

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2022 10:51 PM
Hi @mohit kumar suthar
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!

