- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2022 09:48 PM
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
//Spark Read CSV File
val df = spark.read.option("header",true).csv("address.csv")
//Write DataFrame to address directory
df.write.csv("address")
Above write statement writes a 3 CSV files and .CRC and _SUCCESS files.
Is there any option in Spark not to write these files? I found an article that explains how to remove these files after writing https://sparkbyexamples.com/spark/spark-write-dataframe-single-csv-file/ but I can't use this for several reasons.
Hope the question is clear and looking forward some answer here.
Appreciate.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2022 11:06 PM
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
spark.conf.set("parquet.enable.summary-metadata", "false")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
There parameters avoid writing any metadata files.
The fact you have multiple csv files is the result of parallel processing. If you do not want that you will have to add coalesce(1) to your write statement.
But that will impact the performance of your spark code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2022 11:06 PM
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
spark.conf.set("parquet.enable.summary-metadata", "false")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
There parameters avoid writing any metadata files.
The fact you have multiple csv files is the result of parallel processing. If you do not want that you will have to add coalesce(1) to your write statement.
But that will impact the performance of your spark code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-08-2024 06:09 PM
Will your csv have the name prefix 'part-' or can you name it whatever you like?

