cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Writing to data to a .csv file ( in the Databricks free edition)

DanielW
New Contributor III

 

I have been testing out a notebook. Is is possible to write any tabular outputs to a .csv file and if so would this be the filesystem or an S3 bucket.

I get errrors when I try either approaches with the snippet below.

#sort by match date to get the most recent match
concerts_df=concerts_df.sort_values(by='concert_date', ascending=False)

# Ensure the directory exists using Databricks utilities
output_dir = '/dbfs/tmp'
dbutils.fs.mkdirs(output_dir)
# Write the DataFrame to a CSV file
#concert_df.to_csv(f'{output_dir}/blacksabbath_concerts.csv', index=False)
# Define the S3 bucket and file path
#bucket_name = 'my-s3-bucket'
#file_path = f's3://{bucket_name}/blacksabbath_concerts.csv'

display(matches_df)

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

ilir_nuredini
Honored Contributor

Hello @DanielW 

DBFS (according to the mentioned output_dir variable) is now considered a legacy approach, and you would need to use Unity Catalog Volumes for storing and accessing data files going forward and it is recommended. FYI: the dbfs is disabled in the free edition. Refer below on how you can leverage UC Volume on interacting with files using csv format as an example.

Example upload to UC Volume using python:

1. Using pandas to save as csv file with an example data:

volume_path = "/Volumes/workspace/default/temp/output.csv"
import pandas as pd

df = pd.DataFrame([
    ["Ilir", 30],
    ["Nuredini", 25]
], columns=["name", "age"])

# Save to a Unity Catalog volume path
df.to_csv(volume_path, index=False)

2. Using with open() :

import csv

volume_path = "/Volumes/workspace/default/temp/output2.csv"

rows = [["name", "age"], ["Ilir2", 30], ["Nuredini2", 25]]

# Write CSV using `with open`
with open(volume_path, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerows(rows)

 

Here it is an example how to read a csv file from Volume:

df = spark.read.csv("/Volumes/workspace/default/temp/output2.csv", header=True, inferSchema=True)
df.show()

 

Hope that helps. Let me know if you need a more specific scenario.

Best, Ilir

View solution in original post

4 REPLIES 4

ilir_nuredini
Honored Contributor

Hello @DanielW 

DBFS (according to the mentioned output_dir variable) is now considered a legacy approach, and you would need to use Unity Catalog Volumes for storing and accessing data files going forward and it is recommended. FYI: the dbfs is disabled in the free edition. Refer below on how you can leverage UC Volume on interacting with files using csv format as an example.

Example upload to UC Volume using python:

1. Using pandas to save as csv file with an example data:

volume_path = "/Volumes/workspace/default/temp/output.csv"
import pandas as pd

df = pd.DataFrame([
    ["Ilir", 30],
    ["Nuredini", 25]
], columns=["name", "age"])

# Save to a Unity Catalog volume path
df.to_csv(volume_path, index=False)

2. Using with open() :

import csv

volume_path = "/Volumes/workspace/default/temp/output2.csv"

rows = [["name", "age"], ["Ilir2", 30], ["Nuredini2", 25]]

# Write CSV using `with open`
with open(volume_path, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerows(rows)

 

Here it is an example how to read a csv file from Volume:

df = spark.read.csv("/Volumes/workspace/default/temp/output2.csv", header=True, inferSchema=True)
df.show()

 

Hope that helps. Let me know if you need a more specific scenario.

Best, Ilir

is there not pyspark syntax to write file as csv.?
earlier it was so easy to write in community edition.

DanielW
New Contributor III

Hi @ilir_nuredini ,

Thanks very much for the advice as this really helped esp DFS not available on the free edtions.

I ran into an errror and   I realised I needed to create a volume and schema first. So with with a little help from the AI assistant, 

-- Create a catalog
CREATE CATALOG my_catalog;

-- Create a schema within the catalog
CREATE SCHEMA my_catalog.my_schema;

-- Create a volume within the schema
CREATE VOLUME my_catalog.my_schema.my_volume;

Then the necessary permissions, just for me atm.

%sql
-- Grant privileges
GRANT USE CATALOG ON CATALOG my_catalog TO `your_user_or_group`;
GRANT USE SCHEMA ON SCHEMA my_catalog.my_schema TO `your_user_or_group`;
GRANT WRITE VOLUME ON VOLUME my_catalog.my_schema.my_volume TO `your_user_or_group`;

 

Thereafter I can now reference the volume I need to write to

#create the volume path
filename = "output.csv"
volume_path = "/Volumes/my_catalog/my_schema/my_volume/" + filename

# then write my output
concert_output_df = pd.DataFrame(concerts_df[['concert_id',bandname,......

 

I am not sure of the naming conventions I should use for creating volumes and schemas so would welcome any advice.  I will try out the other snippets you suggested as I am keen to create a few tables with these output files.Thanks abgain.

Hello @DanielW ,

Glad it helped. While there is no single way to how UC should be organized that would it every organization,
I would highly recommend this article, which is super helpful on deciding what naming convention fits 
your scenario the best:

https://bpcs.com/blog/unity-catalog-an-easy-guide-to-naming-conventions

If the reply was helpful, it would be great if you can "accept it" as a solution so fellow colleagues would benefit from this too. Thank you!

Best, Ilir 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now