Databricks

Danielsg94 · ‎08-24-2022

When I use the following code:

df
   .coalesce(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("/path/mydata.csv")

it writes several files, and when used with .mode("overwrite"), it will overwrite everything in the folder.

I want to add a single .csv file to the folder in the blob, without overwriting the content of the path.

Also I am not sure how to name a file. When you specify mydata.csv it creates a folder with that name, and several files inside it.

RRO · ‎08-24-2022

As far as I know this will always create a folder with distributed data.

You can try to use pandas to_csv instead:

df.toPandas().to_csv("dbfs:/mnt/azurestorage/filename.csv")

View solution in original post

RRO · ‎08-24-2022

As far as I know this will always create a folder with distributed data.

You can try to use pandas to_csv instead:

df.toPandas().to_csv("dbfs:/mnt/azurestorage/filename.csv")

Danielsg94 · ‎08-24-2022

Thank you.

Kaniz · ‎09-02-2022

Hi @Daniel Gießing , I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please share it with the community as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Danielsg94 · ‎09-02-2022

Thank you Fatma, the issue has been resolved 🙂

Lukeh · ‎12-01-2023

As per usual, random non Azure or Databricks affiliated YouTuber needs to step in and tell us what to do:

6. How to Write Dataframe as single file with specific name in PySpark | #spark#pyspark#databricks -...

Don't use the Pandas method if you want to write to ABFSS Endpoint as it's not supported in Databricks. It could also cause memory overload issues as it uses one worker instead of distributing.

Essentially, you need to land the output as a temp folder and then loop through all the files, rename your target file from the unhelpfully system generated name to what you actually want it to be called and then use dbutils.fs.cp to copy it to that actual folder you want to save the file to and then delete all the db generated fluff that you don't actually need.

TBH, this is quite an involved process for such a bread and butter endgineering task which is surprising. We have analysts in our business that aren't necessarily well versed in PySpark who are frankly gonna really struggle doing this. The fact that you have to scour the internet for a random YouTuber for the resolution is disappointing considering Azure and Databricks should provide clear instructions if this is how their (Expensive) systems work.

Simha · ‎01-17-2024

Hi Daniel,

May I know, how did you fix this issue. I am facing similar issue while writing csv/parquet to blob/adls, it creates a separate folder with the filename and creates a partition file within that folder.

I need to write just a file on to the blob/adls. Your response will be helpful.

Thanks in advance.

Databricks

How can I write a single file to a blob storage using a Python notebook, to a folder with other data?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI