11-26-2022 03:11 PM
I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1596 about data skew and in command 2 there is comment about how many partitions will be set as default:
// Factor of 8 cores and greater than the expected 825 partitions
What I don't understand is the second part of this sentence: "expected 825 partitions". What is the source of this number? How was it calculated?
11-27-2022 10:20 AM
Hi @Bartosz Maciejewski
Generally we arrive at the number of shuffle partitions using the following method.
Input Size Data - 100 GB
Ideal partition target size - 128 MB
Cores - 8
Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804
To utiltize the cores available properly especially in the last iteration, the number of shuffle partitions should be a factor of the core count else we would not be using the cores properly. Giving too few partition will lead to less concurrency and too many will lead to lots of shuffle.
As for the above example you are referring to, if you calculate ideal number of partitions giving the proper input data size and desired target size (64 MB or 128 MB or whatever below 500 MB), it should come around 825.
Now factor of 8 cores near to 825 is either 824 or 832. If you gave 824, then last iteration would be allocated to the 825th partition alone where 7 out of 8 cores would be idle. So we are going with the next factor which is 832, which has a optimum utilization of all the cores available.
Hope this helps...Do comment if you have any query..
Cheers.
11-26-2022 10:22 PM
Hi @Bartosz Maciejewski
Great to meet you, and thanks for your question!
Let's see if your peers in the community have an answer to your question first. Or else bricksters will get back to you soon.
Thanks
11-27-2022 10:20 AM
Hi @Bartosz Maciejewski
Generally we arrive at the number of shuffle partitions using the following method.
Input Size Data - 100 GB
Ideal partition target size - 128 MB
Cores - 8
Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804
To utiltize the cores available properly especially in the last iteration, the number of shuffle partitions should be a factor of the core count else we would not be using the cores properly. Giving too few partition will lead to less concurrency and too many will lead to lots of shuffle.
As for the above example you are referring to, if you calculate ideal number of partitions giving the proper input data size and desired target size (64 MB or 128 MB or whatever below 500 MB), it should come around 825.
Now factor of 8 cores near to 825 is either 824 or 832. If you gave 824, then last iteration would be allocated to the 825th partition alone where 7 out of 8 cores would be idle. So we are going with the next factor which is 832, which has a optimum utilization of all the cores available.
Hope this helps...Do comment if you have any query..
Cheers.
11-28-2022 04:51 AM
Hi @Uma Maheswara Rao Desula
thank you for the response!
I have checked the real size of the loaded dataset and it was a little bit more than 103 GB. Using you fomula:
(103,1*1024)/128 = 824,8
This is why in the exercise there was 825 partitions and then, as you pointed out, the closest factor of 8 is 832.
Now everything is clear for me - I like to know exactly why each metric was selected 🙂
Cheers
Bartek