cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to create a single CSV file with specified file name Spark in Databricks?

guangyi
Contributor III

I know how to use Spark in Databricks to create a CSV, but it always has lots of side effects.

For example, here is my code:

file_path = โ€œdbfs:/mnt/target_folder/file.csvโ€

df.write.mode("overwrite").csv(file_path, header=True)

Then what I got is

  • A folder with name file.csv
  • In the folder there are files called `_committed_xxxx`, โ€œ_started_xxxxโ€, โ€œ_SUCCESSโ€
  • Multiple files with `part-xxxx`

What I want is only a SINGLE CSV file name with the name `file.csv`, how can I achieve this?

I tried to use pandas.to_csv function, but itโ€™s not working on Databricks notebook, the error is โ€œOSError: Cannot save file into a non-existent directoryโ€

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @guangyi ,

To disable _commited_xxx, _started_xxx and _SUCCSSS you must set below spark options:

 

 

spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") 
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

 

 

And if you want to have single csv file, you need to use coalsece before write operation:

coalesce(1).write.mode("overwrite")



 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now