โ06-28-2023 04:04 PM
So I've been trying to write a file to S3 bucket giving it a custom name, everything I try just ends up with the file being dumped into a folder with the specified name so the output is like ".../file_name/part-001.parquet". instead I want the file to show up as "/file_name.parquet".
โ07-10-2023 07:46 AM - edited โ07-10-2023 07:48 AM
Hi @dsugs thanks for posting here.
You need to use repartition(1) to write the single partition file into s3, then you have to move the single file by giving your file name in the destination_path.
You can use the below snippet:
output_df.repartition(1).write.format(file_format).mode(write_mode).option("header","true").option("inferSchema", "true").save(output_path)
fname = [y.name for y in dbutils.fs.ls(output_path) if y.name.startswith("part-")]
dbutils.fs.mv(output_path + "/" + fname[0],f"{output_path}.parquet")
dbutils.fs.rm(output_path)
# This code first gets a list of all the files in the output_path directory that # start with "part-". This is because Spark writes parquet files to the output_path
# directory in partitions, and we only want to move the first partition.
# The next line moves the first partition to a new file named output_path.parquet.
# Finally, the code deletes the output_path directory.
โ07-10-2023 06:21 AM
@dsugs
This cannot be done directly. We only have access to provide the directory name. A part file is basically one among many files that are going to be under this data directory. So, if you are going to name it as file_name.parquet, then you have to name the second file as file_name2.parquet and so on. It is usually suggested not to modify the file names under the data directory. But if you still insist to do so, you can do a file level copy using dbutils.fs.cp() command and rename each file uniquely in a different location.
โ07-10-2023 07:46 AM - edited โ07-10-2023 07:48 AM
Hi @dsugs thanks for posting here.
You need to use repartition(1) to write the single partition file into s3, then you have to move the single file by giving your file name in the destination_path.
You can use the below snippet:
output_df.repartition(1).write.format(file_format).mode(write_mode).option("header","true").option("inferSchema", "true").save(output_path)
fname = [y.name for y in dbutils.fs.ls(output_path) if y.name.startswith("part-")]
dbutils.fs.mv(output_path + "/" + fname[0],f"{output_path}.parquet")
dbutils.fs.rm(output_path)
# This code first gets a list of all the files in the output_path directory that # start with "part-". This is because Spark writes parquet files to the output_path
# directory in partitions, and we only want to move the first partition.
# The next line moves the first partition to a new file named output_path.parquet.
# Finally, the code deletes the output_path directory.
โ07-12-2023 03:06 AM
Hi @dsugs
Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.
Cheers!
โ07-12-2023 02:08 PM
Spark feature where to avoid network io it writes each shuffle partition as a 'part...' file on disk and each file as you said will have compression and efficient encoding by default.
So Yes it is directly related to parallel processing !!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group