cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Simply writing a dataframe to a CSV file (non-partitioned)

Bilal1
New Contributor III

When writing a dataframe in Pyspark to a CSV file, a folder is created and a partitioned CSV file is created. I have then rename this file in order to distribute it my end user.

Is there any way I can simply write my data to a CSV file, with the name I specified, and have that single file in the folder I specified ?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

it will always write to a folder due to the parallel nature of spark.

If that is an issue, you can use magic command %sh to move the .csv file a level up and also rename it.

So use the 'mv' command.

View solution in original post

7 REPLIES 7

-werners-
Esteemed Contributor III

yes, but you have to do a coalesce(1). This will generate a single csv file, however you will also lose some parallelism as this coalesce(1) is propagated upstream.

Also do not forget to disable the writing of _SUCCESS etc files (see this topic)

Bilal1
New Contributor III

Thanks Werners. however it still writes to a folder, and I still need to rename the file, and copy it out etc.

I would like test1.csv to be a file in the root folder. Not a folder.

image

-werners-
Esteemed Contributor III

it will always write to a folder due to the parallel nature of spark.

If that is an issue, you can use magic command %sh to move the .csv file a level up and also rename it.

So use the 'mv' command.

krutarth
New Contributor II

The csv file will have random name, can you show me how you will rename it without going into hassel of copying its name?

For example lets say name of root folder is Main, inside main i wrote csv using coalsce(1) and the structure is Main/data.csv/RandomBigName-part-00000xyz.csv

Now i want to move csv file inside Main folder and lets say name it as dummyData.csv... So final structure which i want is Main/dummyData.csv

Please help

Nw2this
New Contributor II

Could you please provide an example of using %sh or mv to move and rename the csv?

 

 

Bilal1
New Contributor III

Thanks for confirming that that's the only way 🙂

chris0706
New Contributor II

I know this post is a little old, but Chat GPT actually put together a very clean and straightforward solution for me (in scala):

 

// Set the temporary output directory and the desired final file path
val tempDir = "/tmp/your_file_name"
val finalOutputPath = "/tmp/your_file_name.csv"
 
// Get a DataFrame that contains the relevant CSV file data
val df = spark.table("your_table_name")

// Write DataFrame to a single partition in the temporary directory
df.coalesce(1)
  .write
  .mode("overwrite")
  .option("header", "true")
  .csv(tempDir)

// List the files in the temporary directory to find the CSV file
val csvFile = dbutils.fs.ls(tempDir).filter(file => file.name.endsWith(".csv"))(0).path

// Move and rename the CSV file to the desired location
dbutils.fs.mv(csvFile, finalOutputPath)

// Remove the temporary directory
dbutils.fs.rm(tempDir, true)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group