08-24-2022 12:59 AM
When I use the following code:
df
.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("/path/mydata.csv")
it writes several files, and when used with .mode("overwrite"), it will overwrite everything in the folder.
I want to add a single .csv file to the folder in the blob, without overwriting the content of the path.
Also I am not sure how to name a file. When you specify mydata.csv it creates a folder with that name, and several files inside it.
08-24-2022 07:36 AM
As far as I know this will always create a folder with distributed data.
You can try to use pandas to_csv instead:
df.toPandas().to_csv("dbfs:/mnt/azurestorage/filename.csv")
08-24-2022 07:36 AM
As far as I know this will always create a folder with distributed data.
You can try to use pandas to_csv instead:
df.toPandas().to_csv("dbfs:/mnt/azurestorage/filename.csv")
08-24-2022 07:38 AM
Thank you.
09-02-2022 12:21 AM
Hi @Daniel Gießing , I was checking back to see if his suggestions helped you.
Or else, If you have any solution, please share it with the community as it can be helpful to others.
Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.
09-02-2022 12:23 AM
Thank you Fatma, the issue has been resolved 🙂
12-01-2023 03:05 AM
As per usual, random non Azure or Databricks affiliated YouTuber needs to step in and tell us what to do:
Don't use the Pandas method if you want to write to ABFSS Endpoint as it's not supported in Databricks. It could also cause memory overload issues as it uses one worker instead of distributing.
Essentially, you need to land the output as a temp folder and then loop through all the files, rename your target file from the unhelpfully system generated name to what you actually want it to be called and then use dbutils.fs.cp to copy it to that actual folder you want to save the file to and then delete all the db generated fluff that you don't actually need.
TBH, this is quite an involved process for such a bread and butter endgineering task which is surprising. We have analysts in our business that aren't necessarily well versed in PySpark who are frankly gonna really struggle doing this. The fact that you have to scour the internet for a random YouTuber for the resolution is disappointing considering Azure and Databricks should provide clear instructions if this is how their (Expensive) systems work.
01-17-2024 04:37 AM
Hi Daniel,
May I know, how did you fix this issue. I am facing similar issue while writing csv/parquet to blob/adls, it creates a separate folder with the filename and creates a partition file within that folder.
I need to write just a file on to the blob/adls. Your response will be helpful.
Thanks in advance.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group