Databricks Community

Shivap · ‎02-17-2025

I want to write a file from notebook to blob storage. we have configured unity catalog. When it writes it creates the folder name as the file name that I have provided and inside that it writes multiple files as show below. Can someone suggest me on how I can write it as single file -

_committed_3484505682152580967
_started_3484505682152580967
_SUCCESS
part-00000-tid-34845056821525809
67-77c7321e-c7f1-4194-b5f2-e5194aa2eb52-740-1-c000.csv

Alberto_Umana · ‎02-17-2025

Hello @Shivap,

To write a file as a single file, you can use the appropriate options in the write method to control the output format. The default behavior is generating multiple files such as _committed, _started, _SUCCESS, and part files because the underlying operation defaults to saving the data in a distributed manner.

Here are the steps to ensure you write a single file:

Configure the Output Path: Specify the exact file path where you want the single file to be written.
Use Coalesce or Repartition: If you're working with a DataFrame, use coalesce(1) to collect all data into one partition before writing. This will force Spark to write the data out as a single file.
Save the File with dbutils.fs.cp: Write the DataFrame to a temporary path, then use dbutils.fs.cp to copy the resultant part file to the desired single file path.

Here is an example using these steps:

# Example DataFrame

df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["id", "value"])

# Write DataFrame to a temporary directory

temp_path = "dbfs:/tmp/output"

(df.coalesce(1) # Reduce to one partition

.write

.mode('overwrite')

.option('header', 'true')

.csv(temp_path))

# List the files in the temporary directory

files = dbutils.fs.ls(temp_path)

part_file = [file.path for file in files if file.name.startswith("part")][0]

# Define the final output path

single_file_path = "dbfs:/path/to/final_output.csv"

# Copy the part file to the final output path

dbutils.fs.cp(part_file, single_file_path)

# Clean up temporary directory

dbutils.fs.rm(temp_path, True)

This approach ensures that the data is written to a single file named final_output.csv in your specified location.

Remember, this technique might not be optimal for very large datasets due to the overhead of collecting data to a single partition. For large datasets, consider using appropriate partitioning strategies or handling multiple part files appropriately on the consuming side.

View solution in original post

Stefan-Koch · ‎02-17-2025

Hi Shivap

If you want to save a dataframe as a single file, you could consider to convert the pyspark dataframe to a pandas dataframe and then save it as file.

path_single_file = '/Volumes/demo/raw/test/single'

# create sample dataframe
df = spark.createDataFrame(
    [(i, f'name_{i}', i*10, i%2 == 0, f'2025-02-{i+1:02d}') for i in range(1, 21)],
    ['id', 'name', 'value', 'is_even', 'date']
)

# convert df to pandas dataframe
pdf = df.toPandas()

# create folder, if not exists
import os
if not os.path.exists(path_single_file):
    os.makedirs(path_single_file)

pdf.to_parquet(f'{path_single_file}/my_df.parquet', index=False)

When I then have a look into the directory, it looks like this: