How to create a single CSV file with specified file name Spark in Databricks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2024 01:04 AM
I know how to use Spark in Databricks to create a CSV, but it always has lots of side effects.
For example, here is my code:
file_path = “dbfs:/mnt/target_folder/file.csv”
df.write.mode("overwrite").csv(file_path, header=True)
Then what I got is
- A folder with name file.csv
- In the folder there are files called `_committed_xxxx`, “_started_xxxx”, “_SUCCESS”
- Multiple files with `part-xxxx`
What I want is only a SINGLE CSV file name with the name `file.csv`, how can I achieve this?
I tried to use pandas.to_csv function, but it’s not working on Databricks notebook, the error is “OSError: Cannot save file into a non-existent directory”
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2024 04:07 AM
Hi @guangyi ,
To disable _commited_xxx, _started_xxx and _SUCCSSS you must set below spark options:
spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
And if you want to have single csv file, you need to use coalsece before write operation:
coalesce(1).write.mode("overwrite")

