Spark Optimization
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-17-2024 11:04 PM
Optimizing Shuffle Partition Size in Spark for Large Joins
I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:
- The average shuffle write partition size for the larger table (300 GB) is around 800 MB.
- The average shuffle write partition size for the smaller table (5 GB) is just 1 MB.
I've learned that an optimal shuffle write partition size of around 200 MB is ideal for my use case, but Iโm not sure how to achieve this in Spark.
I've tried the following configurations:
1. `spark.conf.set("spark.sql.shuffle.partitions", 1000)` โ to set the number of shuffle partitions.
2. `spark.conf.set("spark.sql.adaptive.shuffle.targetPostShuffleInputSize", "150MB")` โ to adjust post-shuffle input size.
Despite these changes, the partition sizes are still not as expected.
How can I tune the shuffle partition size to around 200 MB in Spark, specifically for the larger table, to optimize join performance?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-18-2024 02:49 AM
@genevive_mdonรงa
You have calculate the correct number of shuffle partitions for your case considering the cluster configurations.
Please follow this doc to calculate it: https://www.databricks.com/discover/pages/optimize-data-workloads-guide
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-18-2024 09:41 PM
Thanks , will go through this
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-19-2024 12:54 AM
Hi @genevive_mdonรงa ,
You can use following formula to calculate optimal count of partitions based on size of input data and target partition size:
Input Stage Data 300GB
Target Size = 200MB
Optimal Count of Partitions = 300,000 MB / 200 = 1500 partitions
Spark.conf.set(โspark.sql.shuffle.partitionsโ,1500)
Remember, usually partitions should not be less than number of cores
Though, by default Adaptive Query Execution (AQE) should be enabled and Spark can dynamically optimize the partition size based on runtime statistics
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-18-2024 09:05 AM
Have you tried using spark.sql.files.maxPartitionBytes=209715200

