topic Re: Writing back from notebook to blob storage as single file with UC configured databricks in Data Engineering

Writing back from notebook to blob storage as single file with UC configured databricks

Shivap — Tue, 18 Feb 2025 00:47:51 GMT

I want to write a file from notebook to blob storage. we have configured unity catalog. When it writes it creates the folder name as the file name that I have provided and inside that it writes multiple files as show below. Can someone suggest me on how I can write it as single file -

_committed_3484505682152580967
_started_3484505682152580967
_SUCCESS
part-00000-tid-34845056821525809
67-77c7321e-c7f1-4194-b5f2-e5194aa2eb52-740-1-c000.csv

Re: Writing back from notebook to blob storage as single file with UC configured databricks

Alberto_Umana — Tue, 18 Feb 2025 01:49:14 GMT

Hello @Shivap,

To write a file as a single file, you can use the appropriate options in the write method to control the output format. The default behavior is generating multiple files such as _committed, _started, _SUCCESS, and part files because the underlying operation defaults to saving the data in a distributed manner.

Here are the steps to ensure you write a single file:

Configure the Output Path: Specify the exact file path where you want the single file to be written.
Use Coalesce or Repartition: If you're working with a DataFrame, use coalesce(1) to collect all data into one partition before writing. This will force Spark to write the data out as a single file.
Save the File with dbutils.fs.cp: Write the DataFrame to a temporary path, then use dbutils.fs.cp to copy the resultant part file to the desired single file path.

Here is an example using these steps:

# Example DataFrame

df = spark.createDataFrame([(1, "a"), (2, "b"), (3, "c")], ["id", "value"])

# Write DataFrame to a temporary directory

temp_path = "dbfs:/tmp/output"

(df.coalesce(1) # Reduce to one partition

.write

.mode('overwrite')

.option('header', 'true')

.csv(temp_path))

# List the files in the temporary directory

files = dbutils.fs.ls(temp_path)

part_file = [file.path for file in files if file.name.startswith("part")][0]

# Define the final output path

single_file_path = "dbfs:/path/to/final_output.csv"

# Copy the part file to the final output path

dbutils.fs.cp(part_file, single_file_path)

# Clean up temporary directory

dbutils.fs.rm(temp_path, True)

This approach ensures that the data is written to a single file named final_output.csv in your specified location.

Remember, this technique might not be optimal for very large datasets due to the overhead of collecting data to a single partition. For large datasets, consider using appropriate partitioning strategies or handling multiple part files appropriately on the consuming side.

Re: Writing back from notebook to blob storage as single file with UC configured databricks

Stefan-Koch — Tue, 18 Feb 2025 05:59:22 GMT

Hi Shivap

If you want to save a dataframe as a single file, you could consider to convert the pyspark dataframe to a pandas dataframe and then save it as file.

path_single_file = '/Volumes/demo/raw/test/single' # create sample dataframe df = spark.createDataFrame( [(i, f'name_{i}', i*10, i%2 == 0, f'2025-02-{i+1:02d}') for i in range(1, 21)], ['id', 'name', 'value', 'is_even', 'date'] ) # convert df to pandas dataframe pdf = df.toPandas() # create folder, if not exists import os if not os.path.exists(path_single_file): os.makedirs(path_single_file) pdf.to_parquet(f'{path_single_file}/my_df.parquet', index=False)

When I then have a look into the directory, it looks like this: