Optimizing Shuffle Partition Size in Spark for Large Joins
I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:
- The average shuffle write partition size for the larger table (300 GB) is around 800 MB.
- The average shuffle write partition size for the smaller table (5 GB) is just 1 MB.
I've learned that an optimal shuffle write partition size of around 200 MB is ideal for my use case, but Iโm not sure how to achieve this in Spark.
I've tried the following configurations:
1. `spark.conf.set("spark.sql.shuffle.partitions", 1000)` โ to set the number of shuffle partitions.
2. `spark.conf.set("spark.sql.adaptive.shuffle.targetPostShuffleInputSize", "150MB")` โ to adjust post-shuffle input size.
Despite these changes, the partition sizes are still not as expected.
How can I tune the shuffle partition size to around 200 MB in Spark, specifically for the larger table, to optimize join performance?