cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to create a single CSV file with specified file name Spark in Databricks?

guangyi
Contributor III

I know how to use Spark in Databricks to create a CSV, but it always has lots of side effects.

For example, here is my code:

file_path = โ€œdbfs:/mnt/target_folder/file.csvโ€

df.write.mode("overwrite").csv(file_path, header=True)

Then what I got is

  • A folder with name file.csv
  • In the folder there are files called `_committed_xxxx`, โ€œ_started_xxxxโ€, โ€œ_SUCCESSโ€
  • Multiple files with `part-xxxx`

What I want is only a SINGLE CSV file name with the name `file.csv`, how can I achieve this?

I tried to use pandas.to_csv function, but itโ€™s not working on Databricks notebook, the error is โ€œOSError: Cannot save file into a non-existent directoryโ€

1 REPLY 1

szymon_dybczak
Contributor III

Hi @guangyi ,

To disable _commited_xxx, _started_xxx and _SUCCSSS you must set below spark options:

 

 

spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") 
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

 

 

And if you want to have single csv file, you need to use coalsece before write operation:

coalesce(1).write.mode("overwrite")



 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group