How to create a single CSV file with specified file name Spark in Databricks?

guangyi — Thu, 15 Aug 2024 08:04:26 GMT

I know how to use Spark in Databricks to create a CSV, but it always has lots of side effects.

For example, here is my code:

file_path = “dbfs:/mnt/target_folder/file.csv”

df.write.mode("overwrite").csv(file_path, header=True)

Then what I got is

A folder with name file.csv
In the folder there are files called `_committed_xxxx`, “_started_xxxx”, “_SUCCESS”
Multiple files with `part-xxxx`

What I want is only a SINGLE CSV file name with the name `file.csv`, how can I achieve this?

I tried to use pandas.to_csv function, but it’s not working on Databricks notebook, the error is “OSError: Cannot save file into a non-existent directory”

Re: How to create a single CSV file with specified file name Spark in Databricks?

szymon_dybczak — Thu, 15 Aug 2024 11:07:24 GMT

Hi @guangyi ,

To disable _commited_xxx, _started_xxx and _SUCCSSS you must set below spark options:

spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

And if you want to have single csv file, you need to use coalsece before write operation:

coalesce(1).write.mode("overwrite")

topic How to create a single CSV file with specified file name Spark in Databricks? in Data Engineering

How to create a single CSV file with specified file name Spark in Databricks?

Re: How to create a single CSV file with specified file name Spark in Databricks?