Databricks Community

prapot · ‎02-14-2022

val spark:SparkSession = SparkSession.builder()

.master("local[3]")

.appName("SparkByExamples.com")

.getOrCreate()

//Spark Read CSV File

val df = spark.read.option("header",true).csv("address.csv")

//Write DataFrame to address directory

df.write.csv("address")

Above write statement writes a 3 CSV files and .CRC and _SUCCESS files.

Is there any option in Spark not to write these files? I found an article that explains how to remove these files after writing https://sparkbyexamples.com/spark/spark-write-dataframe-single-csv-file/ but I can't use this for several reasons.

Hope the question is clear and looking forward some answer here.

Appreciate.

-werners- · ‎02-14-2022

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

spark.conf.set("parquet.enable.summary-metadata", "false")

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

There parameters avoid writing any metadata files.

The fact you have multiple csv files is the result of parallel processing. If you do not want that you will have to add coalesce(1) to your write statement.

But that will impact the performance of your spark code.

View solution in original post

-werners- · ‎02-14-2022

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

spark.conf.set("parquet.enable.summary-metadata", "false")

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

There parameters avoid writing any metadata files.

The fact you have multiple csv files is the result of parallel processing. If you do not want that you will have to add coalesce(1) to your write statement.

But that will impact the performance of your spark code.

Nw2this · ‎01-08-2024

Will your csv have the name prefix 'part-' or can you name it whatever you like?

Databricks Community

How to write a Spark DataFrame to CSV file with our .CRC in Azure Databricks?

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟