cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How do I write dataframe to s3 without partition column name on the path

Jennifer
New Contributor III

I am currently trying to write a dataframe to s3 like

df.write
.partitionBy("col1","col2")
.mode("overwrite")
.format("json")
.save("s3a://my_bucket/")

The path becomes `s3a://my_bucket/col1=abc/col2=opq/`

But I want to path to be `s3a://my_bucket/abc/opq/`

Is there a way to write the dataframe with a path removing the partition column name?

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @JenniferWhen writing a DataFrame to S3 with partitioning, Spark automatically creates a directory structure based on the partition columns. Unfortunately, itโ€™s not possible to directly remove the column names from the partition path and set the path to just the values.

Hereโ€™s why:

  1. Partitioning Mechanism:

    • Spark uses the directory structure to discover partitions. Each subdirectory corresponds to a specific partition value.
    • Including column names in the path ensures that Spark can correctly identify and read the data during query execution.
  2. Column Dropping:

    • When you use .partitionBy("col1", "col2"), Spark drops the specified columns from the DataFrame and organizes the data into subdirectories based on the unique values of these columns.
    • Removing the column names would break Sparkโ€™s ability to save and read partitioned files.
  3. Workaround:

    • If you want a specific directory structure (e.g., s3a://my_bucket/abc/opq/), you can create a new column in your DataFrame that combines the values of col1 and col2.

    • Then, use this new column for partitioning:

      df = df.withColumn("combined_col", concat(col("col1"), lit("/"), col("col2")))
      df.write.partitionBy("combined_col").mode("overwrite").format("json").save("s3a://my_bucket/")
      
    • This approach allows you to control the directory structure while still maintaining the necessary partition information.

 

 

Jennifer
New Contributor III

Thanks for the quick reply. If I use the new column for partitioning, "combined_col" will still be in the path I think.

Jennifer
New Contributor III

The way I did at the end was to write files to dbfs first and then move them to s3 in order to have a customized path and file name. I could also avoid writing commit files to s3.

Sidhant07
New Contributor III

Hi @Jennifer ,

The default behavior of the .partitionBy() function in Spark is to create a directory structure with partition column names. This is similar to Hive's partitioning scheme and is done for optimization purposes. Hence, you cannot directly change this behavior to remove partition column names from the path.However, you can achieve your desired directory structure by doing a workaround. After saving the dataframe, you can rename the directories in your S3 bucket to remove the partition column names. This will have to be done outside of Spark, using AWS SDK or CLI.Here is an example of how you can do it using AWS CLI:

bash
aws s3 mv s3://my_bucket/col1=abc s3://my_bucket/abc --recursive
aws s3 mv s3://my_bucket/abc/col2=opq s3://my_bucket/abc/opq --recursive

Please note that this operation can be time-consuming if you have a large number of files or directories to rename.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group