Databricks Community

Mohit_Kumar_Sut · ‎10-03-2022

We are reading 520GB partitions files from CSV and when we write in a Single CSV using repartition(1) it is taking 25+ hours. please let us know an optimized way to create a single CSV file so that our process could complete within 5 hours.

Hubert-Dudek · ‎10-03-2022

If you repartition(1), only one core of your whole cluster works. Please use repartition to the number of cores (SparkContext.DefaultParallelism).

After writing, you will get one file per core, so please use other software to merge files if you want to have only one (ADF has some excellent options for that in copy).

Mohit_Kumar_Sut · ‎10-05-2022

Thank you for your time and support, is there any other effective method to combine part CSV files into a single CSV file in databricks?

Hubert-Dudek · ‎10-14-2022

The method in databricks is one that you are using and is slow (repartition(1)).

jose_gonzalez · ‎10-28-2022

You can use coalesce(1) for example:

df.coalesce(1).write.option("header","true").csv("path_to_save_your_CSV")

Anonymous · ‎11-11-2022

Hi @mohit kumar suthar

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!