Databricks Community

guangyi · ‎08-15-2024

I know how to use Spark in Databricks to create a CSV, but it always has lots of side effects.

For example, here is my code:

file_path = “dbfs:/mnt/target_folder/file.csv”

df.write.mode("overwrite").csv(file_path, header=True)

Then what I got is

A folder with name file.csv
In the folder there are files called `_committed_xxxx`, “_started_xxxx”, “_SUCCESS”
Multiple files with `part-xxxx`

What I want is only a SINGLE CSV file name with the name `file.csv`, how can I achieve this?

I tried to use pandas.to_csv function, but it’s not working on Databricks notebook, the error is “OSError: Cannot save file into a non-existent directory”

szymon_dybczak · ‎08-15-2024

Hi @guangyi ,

To disable _commited_xxx, _started_xxx and _SUCCSSS you must set below spark options:

spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") 
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

And if you want to have single csv file, you need to use coalsece before write operation: