Databricks Community

dsugs · ‎06-28-2023

So I've been trying to write a file to S3 bucket giving it a custom name, everything I try just ends up with the file being dumped into a folder with the specified name so the output is like ".../file_name/part-001.parquet". instead I want the file to show up as "/file_name.parquet".

Hemant · ‎07-10-2023

Hi @dsugs thanks for posting here.

You need to use repartition(1) to write the single partition file into s3, then you have to move the single file by giving your file name in the destination_path.
You can use the below snippet:

output_df.repartition(1).write.format(file_format).mode(write_mode).option("header","true").option("inferSchema", "true").save(output_path)

fname = [y.name for y in dbutils.fs.ls(output_path) if y.name.startswith("part-")]
dbutils.fs.mv(output_path + "/" + fname[0],f"{output_path}.parquet")
dbutils.fs.rm(output_path)

# This code first gets a list of all the files in the output_path directory that # start with "part-". This is because Spark writes parquet files to the output_path

# directory in partitions, and we only want to move the first partition.

# The next line moves the first partition to a new file named output_path.parquet.

# Finally, the code deletes the output_path directory.

Hemant Soni

View solution in original post

Tharun-Kumar · ‎07-10-2023

@dsugs
This cannot be done directly. We only have access to provide the directory name. A part file is basically one among many files that are going to be under this data directory. So, if you are going to name it as file_name.parquet, then you have to name the second file as file_name2.parquet and so on. It is usually suggested not to modify the file names under the data directory. But if you still insist to do so, you can do a file level copy using dbutils.fs.cp() command and rename each file uniquely in a different location.

Hemant · ‎07-10-2023

Hi @dsugs thanks for posting here.

You need to use repartition(1) to write the single partition file into s3, then you have to move the single file by giving your file name in the destination_path.
You can use the below snippet:

output_df.repartition(1).write.format(file_format).mode(write_mode).option("header","true").option("inferSchema", "true").save(output_path)

fname = [y.name for y in dbutils.fs.ls(output_path) if y.name.startswith("part-")]
dbutils.fs.mv(output_path + "/" + fname[0],f"{output_path}.parquet")
dbutils.fs.rm(output_path)

# This code first gets a list of all the files in the output_path directory that # start with "part-". This is because Spark writes parquet files to the output_path

# directory in partitions, and we only want to move the first partition.

# The next line moves the first partition to a new file named output_path.parquet.

# Finally, the code deletes the output_path directory.

Hemant Soni

Anonymous · ‎07-12-2023

Hi @dsugs

Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.

Cheers!

rdkarthikeyan27 · ‎07-12-2023

Spark feature where to avoid network io it writes each shuffle partition as a 'part...' file on disk and each file as you said will have compression and efficient encoding by default.

So Yes it is directly related to parallel processing !!