cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Write in Single CSV file

Mohit_Kumar_Sut
New Contributor III

We are reading 520GB partitions files from CSV and when we write in a Single CSV using repartition(1) it is taking 25+ hours. please let us know an optimized way to create a single CSV file so that our process could complete within 5 hours.

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

If you repartition(1), only one core of your whole cluster works. Please use repartition to the number of cores (SparkContext.DefaultParallelism).

After writing, you will get one file per core, so please use other software to merge files if you want to have only one (ADF has some excellent options for that in copy).

Thank you for your time and support, is there any other effective method to combine part CSV files into a single CSV file in databricks?

Hubert-Dudek
Esteemed Contributor III

The method in databricks is one that you are using and is slow (repartition(1)).

You can use coalesce(1) for example:

df.coalesce(1).write.option("header","true").csv("path_to_save_your_CSV")

Anonymous
Not applicable

Hi @mohit kumar suthar​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group