Databricks Community

genevive_mdonça · ‎11-17-2024

Optimizing Shuffle Partition Size in Spark for Large Joins

I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:
- The average shuffle write partition size for the larger table (300 GB) is around 800 MB.
- The average shuffle write partition size for the smaller table (5 GB) is just 1 MB.

I've learned that an optimal shuffle write partition size of around 200 MB is ideal for my use case, but I’m not sure how to achieve this in Spark.

I've tried the following configurations:
1. `spark.conf.set("spark.sql.shuffle.partitions", 1000)` — to set the number of shuffle partitions.
2. `spark.conf.set("spark.sql.adaptive.shuffle.targetPostShuffleInputSize", "150MB")` — to adjust post-shuffle input size.

Despite these changes, the partition sizes are still not as expected.

How can I tune the shuffle partition size to around 200 MB in Spark, specifically for the larger table, to optimize join performance?

MuthuLakshmi · ‎11-18-2024

@genevive_mdonça
You have calculate the correct number of shuffle partitions for your case considering the cluster configurations.

Please follow this doc to calculate it: https://www.databricks.com/discover/pages/optimize-data-workloads-guide

genevive_mdonça · ‎11-18-2024

Thanks , will go through this

szymon_dybczak · ‎11-19-2024

Hi @genevive_mdonça ,

You can use following formula to calculate optimal count of partitions based on size of input data and target partition size:

Input Stage Data 300GB
Target Size = 200MB
Optimal Count of Partitions = 300,000 MB / 200 = 1500 partitions
Spark.conf.set(“spark.sql.shuffle.partitions”,1500)
Remember, usually partitions should not be less than number of cores

Though, by default Adaptive Query Execution (AQE) should be enabled and Spark can dynamically optimize the partition size based on runtime statistics