cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to create a single CSV file with specified file name Spark in Databricks?

guangyi
Contributor

I know how to use Spark in Databricks to create a CSV, but it always has lots of side effects.

For example, here is my code:

file_path = “dbfs:/mnt/target_folder/file.csv”

df.write.mode("overwrite").csv(file_path, header=True)

Then what I got is

  • A folder with name file.csv
  • In the folder there are files called `_committed_xxxx`, “_started_xxxx”, “_SUCCESS”
  • Multiple files with `part-xxxx`

What I want is only a SINGLE CSV file name with the name `file.csv`, how can I achieve this?

I tried to use pandas.to_csv function, but it’s not working on Databricks notebook, the error is “OSError: Cannot save file into a non-existent directory”

1 REPLY 1

Slash
Contributor

Hi @guangyi ,

To disable _commited_xxx, _started_xxx and _SUCCSSS you must set below spark options:

 

 

spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") 
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

 

 

And if you want to have single csv file, you need to use coalsece before write operation:

coalesce(1).write.mode("overwrite")



 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group