cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Writing back from notebook to blob storage as single file with UC configured databricks

Shivap
New Contributor III

I want to write a file from notebook to blob storage. we have configured unity catalog. When it writes it creates the folder name as the file name that I have provided and inside that it writes multiple files as show below. Can someone suggest me on how I can write it as single file - 

_committed_3484505682152580967
_started_3484505682152580967
_SUCCESS
part-00000-tid-34845056821525809
67-77c7321e-c7f1-4194-b5f2-e5194aa2eb52-740-1-c000.csv

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @Shivap,

To write a file as a single file, you can use the appropriate options in the write method to control the output format. The default behavior is generating multiple files such as _committed, _started, _SUCCESS, and part files because the underlying operation defaults to saving the data in a distributed manner.

Here are the steps to ensure you write a single file:

  1. Configure the Output Path: Specify the exact file path where you want the single file to be written.
  2. Use Coalesce or Repartition: If you're working with a DataFrame, use coalesce(1) to collect all data into one partition before writing. This will force Spark to write the data out as a single file.
  3. Save the File with dbutils.fs.cp: Write the DataFrame to a temporary path, then use dbutils.fs.cp to copy the resultant part file to the desired single file path.

Here is an example using these steps:

# Example DataFrame

df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["id", "value"])

# Write DataFrame to a temporary directory

temp_path = "dbfs:/tmp/output"

(df.coalesce(1) # Reduce to one partition

.write

.mode('overwrite')

.option('header', 'true')

.csv(temp_path))

# List the files in the temporary directory

files = dbutils.fs.ls(temp_path)

part_file = [file.path for file in files if file.name.startswith("part")][0]

# Define the final output path

single_file_path = "dbfs:/path/to/final_output.csv"

# Copy the part file to the final output path

dbutils.fs.cp(part_file, single_file_path)

# Clean up temporary directory

dbutils.fs.rm(temp_path, True)

This approach ensures that the data is written to a single file named final_output.csv in your specified location.

Remember, this technique might not be optimal for very large datasets due to the overhead of collecting data to a single partition. For large datasets, consider using appropriate partitioning strategies or handling multiple part files appropriately on the consuming side.

View solution in original post

Stefan-Koch
Valued Contributor II

Hi Shivap

If you want to save a dataframe as a single file, you could consider to convert the pyspark dataframe to a pandas dataframe and then save it as file.

 

path_single_file = '/Volumes/demo/raw/test/single'

# create sample dataframe
df = spark.createDataFrame(
    [(i, f'name_{i}', i*10, i%2 == 0, f'2025-02-{i+1:02d}') for i in range(1, 21)],
    ['id', 'name', 'value', 'is_even', 'date']
)

# convert df to pandas dataframe
pdf = df.toPandas()

# create folder, if not exists
import os
if not os.path.exists(path_single_file):
    os.makedirs(path_single_file)

pdf.to_parquet(f'{path_single_file}/my_df.parquet', index=False)

 

When I then have a look into the directory, it looks like this:

StefanKoch_0-1739858098639.png

 

 

View solution in original post

2 REPLIES 2

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @Shivap,

To write a file as a single file, you can use the appropriate options in the write method to control the output format. The default behavior is generating multiple files such as _committed, _started, _SUCCESS, and part files because the underlying operation defaults to saving the data in a distributed manner.

Here are the steps to ensure you write a single file:

  1. Configure the Output Path: Specify the exact file path where you want the single file to be written.
  2. Use Coalesce or Repartition: If you're working with a DataFrame, use coalesce(1) to collect all data into one partition before writing. This will force Spark to write the data out as a single file.
  3. Save the File with dbutils.fs.cp: Write the DataFrame to a temporary path, then use dbutils.fs.cp to copy the resultant part file to the desired single file path.

Here is an example using these steps:

# Example DataFrame

df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["id", "value"])

# Write DataFrame to a temporary directory

temp_path = "dbfs:/tmp/output"

(df.coalesce(1) # Reduce to one partition

.write

.mode('overwrite')

.option('header', 'true')

.csv(temp_path))

# List the files in the temporary directory

files = dbutils.fs.ls(temp_path)

part_file = [file.path for file in files if file.name.startswith("part")][0]

# Define the final output path

single_file_path = "dbfs:/path/to/final_output.csv"

# Copy the part file to the final output path

dbutils.fs.cp(part_file, single_file_path)

# Clean up temporary directory

dbutils.fs.rm(temp_path, True)

This approach ensures that the data is written to a single file named final_output.csv in your specified location.

Remember, this technique might not be optimal for very large datasets due to the overhead of collecting data to a single partition. For large datasets, consider using appropriate partitioning strategies or handling multiple part files appropriately on the consuming side.

Stefan-Koch
Valued Contributor II

Hi Shivap

If you want to save a dataframe as a single file, you could consider to convert the pyspark dataframe to a pandas dataframe and then save it as file.

 

path_single_file = '/Volumes/demo/raw/test/single'

# create sample dataframe
df = spark.createDataFrame(
    [(i, f'name_{i}', i*10, i%2 == 0, f'2025-02-{i+1:02d}') for i in range(1, 21)],
    ['id', 'name', 'value', 'is_even', 'date']
)

# convert df to pandas dataframe
pdf = df.toPandas()

# create folder, if not exists
import os
if not os.path.exists(path_single_file):
    os.makedirs(path_single_file)

pdf.to_parquet(f'{path_single_file}/my_df.parquet', index=False)

 

When I then have a look into the directory, it looks like this:

StefanKoch_0-1739858098639.png

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group