cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to write a Spark DataFrame to CSV file with our .CRC in Azure Databricks?

prapot
New Contributor II

val spark:SparkSession = SparkSession.builder()

.master("local[3]")

.appName("SparkByExamples.com")

.getOrCreate()

//Spark Read CSV File

val df = spark.read.option("header",true).csv("address.csv")

//Write DataFrame to address directory

df.write.csv("address")

Above write statement writes a 3 CSV files and .CRC and _SUCCESS files.

Is there any option in Spark not to write these files? I found an article that explains how to remove these files after writing https://sparkbyexamples.com/spark/spark-write-dataframe-single-csv-file/ but I can't use this for several reasons.

Hope the question is clear and looking forward some answer here.

Appreciate.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

spark.conf.set("parquet.enable.summary-metadata", "false")

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

There parameters avoid writing any metadata files.

The fact you have multiple csv files is the result of parallel processing. If you do not want that you will have to add coalesce(1) to your write statement.

But that will impact the performance of your spark code.

View solution in original post

2 REPLIES 2

-werners-
Esteemed Contributor III

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

spark.conf.set("parquet.enable.summary-metadata", "false")

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

There parameters avoid writing any metadata files.

The fact you have multiple csv files is the result of parallel processing. If you do not want that you will have to add coalesce(1) to your write statement.

But that will impact the performance of your spark code.

Nw2this
New Contributor II

Will your csv have the name prefix 'part-' or can you name it whatever you like?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group