cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

write file as csv format

pop_smoke
New Contributor

Is there any simple pyspark syntax to write data in csv format into a file or anywhere in free edition of databrick? in community edition , it was so easy 

 

3 ACCEPTED SOLUTIONS

Accepted Solutions

BS_THE_ANALYST
Esteemed Contributor

@pop_smoke you also have the option to just literally write it out as a single CSV as such:

BS_THE_ANALYST_0-1757162393444.png

This does involve converting it to a pandas dataframe though. 

Just depends on your usecase โ˜บ๏ธ.

Syntax

# Convert to Pandas and save locally (good for small DataFrames)
df.toPandas().to_csv("/Volumes/workspace/default/volume_files/media_customer_reviews_single.csv", index=False)



All the best,
BS

View solution in original post

@pop_smoke  the reason for that is because you're using Pyspark (distributed compute) vs Pandas (typically non distributed).

With big data processing engines, like Spark, the work is normally distributed across many computers (nodes/workers). When you want to write files out, typically, with big data, it's written out in partitions i.e. many files. It's easier just to have that contained in a directory. Whether it's a single CSV or many CSVs, it's just a scalable solution to write out more or more files into a single directory. You may find yourself with many files when writing out due to the default partitions that are created when you create a spark dataframe. This is something that you can alter, I believe. Have a google or chatGPT about the default number of partitions when writing out from a spark dataframe, it'll be a good read.

# Convert to Pandas and save locally (good for small DataFrames)
df.toPandas().to_csv("/Volumes/workspace/default/volume_files/media_customer_reviews_single.csv", index=False)

 

If you want the "single CSV", use the pandas solution I provided above. Let me know if that works โ˜บ๏ธ

All the best,
BS

View solution in original post

BS_THE_ANALYST
Esteemed Contributor

@pop_smoke no worries! My background is with Alteryx (ETL tool). I too am learning Databricks ๐Ÿ˜€

I look forward to seeing you in the forum โ˜บ๏ธ. Please share any cool things you find or any projects you do ๐Ÿ‘.

All the best,
BS

View solution in original post

8 REPLIES 8

BS_THE_ANALYST
Esteemed Contributor

@pop_smoke a typical solution would be to store the .csvs in a Volume within your Unity Catalog in the Free Edition

BS_THE_ANALYST_0-1757161552297.png

Here's an example:

BS_THE_ANALYST_1-1757161857738.png

Syntax used for writing
One Example:

df.write.format("csv").mode("overwrite").save("/Volumes/workspace/default/volume_files/media_customer_reviews")

Another Example:

df.write.csv("/Volumes/workspace/default/volume_files/media_customer_reviews", header=True, mode="overwrite")

Official docs for syntax: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWrite... 

All the best,
BS



emp_filter_age= emp_filtered_1.select("emp_id","name","salary","age").where("age > 30")
display(emp_filter_age)
 
emp_filter_age.coalesce(1).write.format("csv").mode("ignore").option("header", "true").save("/Volumes/workspace/default/volume_file/age_greater_30")
 
i am using this way but the problem is that i have to  create a new directory everytime under volume_file everytime i write a different file. we can name the directory but is there any way that if we are collection the data as a single csv file or single partition then can we name that particular file as we want inside the directory.
 

@pop_smoke that's a great question. If I'm honest, I'm not actually too sure if you can control the name of the underlying CSV (when writing out from a pyspark dataframe). I'm not saying this is best practice but I think you could pretty much write out and then rename it afterwards ๐Ÿค”๐Ÿ˜‚. Happy for other community members to show me otherwise, always willing to learn โ˜บ๏ธ

dbutils has a bunch of cool stuff: https://docs.databricks.com/aws/en/dev-tools/databricks-utils .. one of those is being able to move/copy/rename/delete files, it's pretty similar to "shutil" and "os" in standard python modules.

So, if we look at what gets written out:

BS_THE_ANALYST_0-1757193378205.png

If we target the parent directory i.e. (1) we can then rename all of the .csv files within it. One possible way would be to use something like for loop with a "counter". This will iterate over each of the files, rename them, and the counter will increase to provide a unique index for the next loop.  We'll end up with something like {file_name}_1.csv ... {file_name}_2.csv ... {file_name}_3.csv .. remember, you could have many of the .csvs in your directory depending on the partititions. So I think a loop and a rename works here. Again, @pop_smoke , I'm not sure if this is best practice by any means ๐Ÿ˜‚.

This is the code prepped ready to iterate through an rename

target_directory = "/Volumes/workspace/default/volume_files/media_customer_reviews"
new_file_name_prexfix = "media_customer_reviews"

i=1
for file in dbutils.fs.ls(target_directory):
    if file.name.startswith("part-"):
        dbutils.fs.mv(file.path, target_directory+"/"+new_file_name_prexfix+str(i)+".csv")
        i+=1

BS_THE_ANALYST_1-1757193927522.png

The result:

BS_THE_ANALYST_2-1757194000093.png

I guess, if you wanted to, you could also remove all the files in that directory i.e. "sucess" if you wanted to. DBUtils can do that. I'll leave that one to you ๐Ÿค”๐Ÿ˜‚.

@pop_smoke solutions, in the community world, are like liquid gold. Only use them for the posts that solve you problem. This puts a higher value on them when you receive them. Liking the post is just as good โ˜บ๏ธ. Feel free to remove them from any of my previous posts that didn't answer your problem. 

All the best,
BS

BS_THE_ANALYST
Esteemed Contributor

@pop_smoke you also have the option to just literally write it out as a single CSV as such:

BS_THE_ANALYST_0-1757162393444.png

This does involve converting it to a pandas dataframe though. 

Just depends on your usecase โ˜บ๏ธ.

Syntax

# Convert to Pandas and save locally (good for small DataFrames)
df.toPandas().to_csv("/Volumes/workspace/default/volume_files/media_customer_reviews_single.csv", index=False)



All the best,
BS

Thank you so much. i did it now . but they are not showing me in just one part . it has created part for every row. is there anything that i can do? you have a made a folder media_customer_reviews . is it is necessary to make a folder everytime we write a new file

@pop_smoke  the reason for that is because you're using Pyspark (distributed compute) vs Pandas (typically non distributed).

With big data processing engines, like Spark, the work is normally distributed across many computers (nodes/workers). When you want to write files out, typically, with big data, it's written out in partitions i.e. many files. It's easier just to have that contained in a directory. Whether it's a single CSV or many CSVs, it's just a scalable solution to write out more or more files into a single directory. You may find yourself with many files when writing out due to the default partitions that are created when you create a spark dataframe. This is something that you can alter, I believe. Have a google or chatGPT about the default number of partitions when writing out from a spark dataframe, it'll be a good read.

# Convert to Pandas and save locally (good for small DataFrames)
df.toPandas().to_csv("/Volumes/workspace/default/volume_files/media_customer_reviews_single.csv", index=False)

 

If you want the "single CSV", use the pandas solution I provided above. Let me know if that works โ˜บ๏ธ

All the best,
BS

I just added coalesce in the same syntax that you provided me first time and i did not use pandas and i got the file in one file as CSV . i am from ab initio (old ETL software ) background . so i was little confused . we have multifile and serial file system in ab initio.
thank you

BS_THE_ANALYST
Esteemed Contributor

@pop_smoke no worries! My background is with Alteryx (ETL tool). I too am learning Databricks ๐Ÿ˜€

I look forward to seeing you in the forum โ˜บ๏ธ. Please share any cool things you find or any projects you do ๐Ÿ‘.

All the best,
BS

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now