- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-17-2025 04:47 PM
I want to write a file from notebook to blob storage. we have configured unity catalog. When it writes it creates the folder name as the file name that I have provided and inside that it writes multiple files as show below. Can someone suggest me on how I can write it as single file -
_committed_3484505682152580967
_started_3484505682152580967
_SUCCESS
part-00000-tid-34845056821525809
67-77c7321e-c7f1-4194-b5f2-e5194aa2eb52-740-1-c000.csv
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-17-2025 05:49 PM
Hello @Shivap,
To write a file as a single file, you can use the appropriate options in the write method to control the output format. The default behavior is generating multiple files such as _committed, _started, _SUCCESS, and part files because the underlying operation defaults to saving the data in a distributed manner.
Here are the steps to ensure you write a single file:
- Configure the Output Path: Specify the exact file path where you want the single file to be written.
- Use Coalesce or Repartition: If you're working with a DataFrame, use coalesce(1) to collect all data into one partition before writing. This will force Spark to write the data out as a single file.
- Save the File with dbutils.fs.cp: Write the DataFrame to a temporary path, then use dbutils.fs.cp to copy the resultant part file to the desired single file path.
Here is an example using these steps:
# Example DataFrame
df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["id", "value"])
# Write DataFrame to a temporary directory
temp_path = "dbfs:/tmp/output"
(df.coalesce(1) # Reduce to one partition
.write
.mode('overwrite')
.option('header', 'true')
.csv(temp_path))
# List the files in the temporary directory
files = dbutils.fs.ls(temp_path)
part_file = [file.path for file in files if file.name.startswith("part")][0]
# Define the final output path
single_file_path = "dbfs:/path/to/final_output.csv"
# Copy the part file to the final output path
dbutils.fs.cp(part_file, single_file_path)
# Clean up temporary directory
dbutils.fs.rm(temp_path, True)
This approach ensures that the data is written to a single file named final_output.csv in your specified location.
Remember, this technique might not be optimal for very large datasets due to the overhead of collecting data to a single partition. For large datasets, consider using appropriate partitioning strategies or handling multiple part files appropriately on the consuming side.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-17-2025 09:55 PM - edited 02-17-2025 09:59 PM
Hi Shivap
If you want to save a dataframe as a single file, you could consider to convert the pyspark dataframe to a pandas dataframe and then save it as file.
path_single_file = '/Volumes/demo/raw/test/single'
# create sample dataframe
df = spark.createDataFrame(
[(i, f'name_{i}', i*10, i%2 == 0, f'2025-02-{i+1:02d}') for i in range(1, 21)],
['id', 'name', 'value', 'is_even', 'date']
)
# convert df to pandas dataframe
pdf = df.toPandas()
# create folder, if not exists
import os
if not os.path.exists(path_single_file):
os.makedirs(path_single_file)
pdf.to_parquet(f'{path_single_file}/my_df.parquet', index=False)
When I then have a look into the directory, it looks like this:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-17-2025 05:49 PM
Hello @Shivap,
To write a file as a single file, you can use the appropriate options in the write method to control the output format. The default behavior is generating multiple files such as _committed, _started, _SUCCESS, and part files because the underlying operation defaults to saving the data in a distributed manner.
Here are the steps to ensure you write a single file:
- Configure the Output Path: Specify the exact file path where you want the single file to be written.
- Use Coalesce or Repartition: If you're working with a DataFrame, use coalesce(1) to collect all data into one partition before writing. This will force Spark to write the data out as a single file.
- Save the File with dbutils.fs.cp: Write the DataFrame to a temporary path, then use dbutils.fs.cp to copy the resultant part file to the desired single file path.
Here is an example using these steps:
# Example DataFrame
df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["id", "value"])
# Write DataFrame to a temporary directory
temp_path = "dbfs:/tmp/output"
(df.coalesce(1) # Reduce to one partition
.write
.mode('overwrite')
.option('header', 'true')
.csv(temp_path))
# List the files in the temporary directory
files = dbutils.fs.ls(temp_path)
part_file = [file.path for file in files if file.name.startswith("part")][0]
# Define the final output path
single_file_path = "dbfs:/path/to/final_output.csv"
# Copy the part file to the final output path
dbutils.fs.cp(part_file, single_file_path)
# Clean up temporary directory
dbutils.fs.rm(temp_path, True)
This approach ensures that the data is written to a single file named final_output.csv in your specified location.
Remember, this technique might not be optimal for very large datasets due to the overhead of collecting data to a single partition. For large datasets, consider using appropriate partitioning strategies or handling multiple part files appropriately on the consuming side.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-17-2025 09:55 PM - edited 02-17-2025 09:59 PM
Hi Shivap
If you want to save a dataframe as a single file, you could consider to convert the pyspark dataframe to a pandas dataframe and then save it as file.
path_single_file = '/Volumes/demo/raw/test/single'
# create sample dataframe
df = spark.createDataFrame(
[(i, f'name_{i}', i*10, i%2 == 0, f'2025-02-{i+1:02d}') for i in range(1, 21)],
['id', 'name', 'value', 'is_even', 'date']
)
# convert df to pandas dataframe
pdf = df.toPandas()
# create folder, if not exists
import os
if not os.path.exists(path_single_file):
os.makedirs(path_single_file)
pdf.to_parquet(f'{path_single_file}/my_df.parquet', index=False)
When I then have a look into the directory, it looks like this:

