cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I write a single file to a blob storage using a Python notebook, to a folder with other data?

Danielsg94
New Contributor II

When I use the following code:

df
   .coalesce(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("/path/mydata.csv")

it writes several files, and when used with .mode("overwrite"), it will overwrite everything in the folder.

I want to add a single .csv file to the folder in the blob, without overwriting the content of the path.

Also I am not sure how to name a file. When you specify mydata.csv it creates a folder with that name, and several files inside it.

1 ACCEPTED SOLUTION

Accepted Solutions

RRO
Contributor

As far as I know this will always create a folder with distributed data.

You can try to use pandas to_csv instead:

df.toPandas().to_csv("dbfs:/mnt/azurestorage/filename.csv")

View solution in original post

6 REPLIES 6

RRO
Contributor

As far as I know this will always create a folder with distributed data.

You can try to use pandas to_csv instead:

df.toPandas().to_csv("dbfs:/mnt/azurestorage/filename.csv")

Danielsg94
New Contributor II

Thank you.

Kaniz
Community Manager
Community Manager

Hi @Daniel Gießing​ , I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please share it with the community as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Danielsg94
New Contributor II

Thank you Fatma, the issue has been resolved 🙂

Lukeh
New Contributor II

As per usual, random non Azure or Databricks affiliated YouTuber needs to step in and tell us what to do:

6. How to Write Dataframe as single file with specific name in PySpark | #spark#pyspark#databricks -...

Don't use the Pandas method if you want to write to ABFSS Endpoint as it's not supported in Databricks. It could also cause memory overload issues as it uses one worker instead of distributing. 

Essentially, you need to land the output as a temp folder and then loop through all the files, rename your target file from the unhelpfully system generated name to what you actually want it to be called and then use dbutils.fs.cp to copy it to that actual folder you want to save the file to and then delete all the db generated fluff that you don't actually need.

TBH, this is quite an involved process for such a bread and butter endgineering task which is surprising. We have analysts in our business that aren't necessarily well versed in PySpark who are frankly gonna really struggle doing this. The fact that you have to scour the internet for a random YouTuber for the resolution is disappointing considering Azure and Databricks should provide clear instructions if this is how their (Expensive) systems work.

Simha
New Contributor II

Hi Daniel,

May I know, how did you fix this issue. I am facing similar issue while writing csv/parquet to blob/adls, it creates a separate folder with the filename and creates a partition file within that folder.

I need to write just a file on to the blob/adls. Your response will be helpful.

Thanks in advance.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.