cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Spark Optimization

genevive_mdonรงa
Databricks Employee
Databricks Employee

Optimizing Shuffle Partition Size in Spark for Large Joins

I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:
- The average shuffle write partition size for the larger table (300 GB) is around 800 MB.
- The average shuffle write partition size for the smaller table (5 GB) is just 1 MB.

I've learned that an optimal shuffle write partition size of around 200 MB is ideal for my use case, but Iโ€™m not sure how to achieve this in Spark.

I've tried the following configurations:
1. `spark.conf.set("spark.sql.shuffle.partitions", 1000)` โ€” to set the number of shuffle partitions.
2. `spark.conf.set("spark.sql.adaptive.shuffle.targetPostShuffleInputSize", "150MB")` โ€” to adjust post-shuffle input size.

Despite these changes, the partition sizes are still not as expected.

How can I tune the shuffle partition size to around 200 MB in Spark, specifically for the larger table, to optimize join performance?

 

4 REPLIES 4

MuthuLakshmi
Databricks Employee
Databricks Employee

@genevive_mdonรงa 
You have calculate the correct number of shuffle partitions for your case considering the cluster configurations. 

Please follow this doc to calculate it: https://www.databricks.com/discover/pages/optimize-data-workloads-guide

Thanks , will go through this 

Hi @genevive_mdonรงa ,

You can use following formula to calculate optimal count of partitions based on size of input data and target partition size:

Input Stage Data 300GB
Target Size = 200MB
Optimal Count of Partitions = 300,000 MB / 200 = 1500 partitions
Spark.conf.set(โ€œspark.sql.shuffle.partitionsโ€,1500)
Remember, usually partitions should not be less than number of cores

Though, by default  Adaptive Query Execution (AQE) should be enabled and Spark can dynamically optimize the partition size based on runtime statistics

Lakshay
Databricks Employee
Databricks Employee

Have you tried using spark.sql.files.maxPartitionBytes=209715200

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group