topic Spark Optimization in Data Engineering

Spark Optimization

genevive_mdonça — Mon, 18 Nov 2024 07:04:01 GMT

Optimizing Shuffle Partition Size in Spark for Large Joins

I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:
- The average shuffle write partition size for the larger table (300 GB) is around 800 MB.
- The average shuffle write partition size for the smaller table (5 GB) is just 1 MB.

I've learned that an optimal shuffle write partition size of around 200 MB is ideal for my use case, but I’m not sure how to achieve this in Spark.

I've tried the following configurations:
1. `spark.conf.set("spark.sql.shuffle.partitions", 1000)` — to set the number of shuffle partitions.
2. `spark.conf.set("spark.sql.adaptive.shuffle.targetPostShuffleInputSize", "150MB")` — to adjust post-shuffle input size.

Despite these changes, the partition sizes are still not as expected.

How can I tune the shuffle partition size to around 200 MB in Spark, specifically for the larger table, to optimize join performance?

Re: Spark Optimization

MuthuLakshmi — Mon, 18 Nov 2024 10:49:03 GMT

@genevive_mdonça
You have calculate the correct number of shuffle partitions for your case considering the cluster configurations.

Please follow this doc to calculate it: https://www.databricks.com/discover/pages/optimize-data-workloads-guide

Re: Spark Optimization

Lakshay — Mon, 18 Nov 2024 17:05:05 GMT

Have you tried using spark.sql.files.maxPartitionBytes=209715200

Re: Spark Optimization

genevive_mdonça — Tue, 19 Nov 2024 05:41:28 GMT

Thanks , will go through this

Re: Spark Optimization

szymon_dybczak — Tue, 19 Nov 2024 08:54:35 GMT

Hi @genevive_mdonça ,

You can use following formula to calculate optimal count of partitions based on size of input data and target partition size:

Input Stage Data 300GB
Target Size = 200MB
Optimal Count of Partitions = 300,000 MB / 200 = 1500 partitions
Spark.conf.set(“spark.sql.shuffle.partitions”,1500)
Remember, usually partitions should not be less than number of cores

Though, by default Adaptive Query Execution (AQE) should be enabled and Spark can dynamically optimize the partition size based on runtime statistics