cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Write in Single CSV file

Mohit_Kumar_Sut
New Contributor III

We are reading 520GB partitions files from CSV and when we write in a Single CSV using repartition(1) it is taking 25+ hours. please let us know an optimized way to create a single CSV file so that our process could complete within 5 hours.

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

If you repartition(1), only one core of your whole cluster works. Please use repartition to the number of cores (SparkContext.DefaultParallelism).

After writing, you will get one file per core, so please use other software to merge files if you want to have only one (ADF has some excellent options for that in copy).

Thank you for your time and support, is there any other effective method to combine part CSV files into a single CSV file in databricks?

Hubert-Dudek
Esteemed Contributor III

The method in databricks is one that you are using and is slow (repartition(1)).

You can use coalesce(1) for example:

df.coalesce(1).write.option("header","true").csv("path_to_save_your_CSV")

Anonymous
Not applicable

Hi @mohit kumar suthar​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.