cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Number of partitions in Spark UI Simulator experiment

Bartek
Contributor

I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1​596 about data skew and in command 2 there is comment about how many partitions will be set as default:

// Factor of 8 cores and greater than the expected 825 partitions

What I don't understand is the second part of this sentence: "expected 825 partitions". What is the source of this number? How was it calculated?

obraz

1 ACCEPTED SOLUTION

Accepted Solutions

UmaMahesh1
Honored Contributor III

Hi @Bartosz Maciejewski​ 

Generally we arrive at the number of shuffle partitions using the following method.

Input Size Data - 100 GB

Ideal partition target size - 128 MB

Cores - 8

Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804

To utiltize the cores available properly especially in the last iteration, the number of shuffle partitions should be a factor of the core count else we would not be using the cores properly. Giving too few partition will lead to less concurrency and too many will lead to lots of shuffle.

As for the above example you are referring to, if you calculate ideal number of partitions giving the proper input data size and desired target size (64 MB or 128 MB or whatever below 500 MB), it should come around 825.

Now factor of 8 cores near to 825 is either 824 or 832. If you gave 824, then last iteration would be allocated to the 825th partition alone where 7 out of 8 cores would be idle. So we are going with the next factor which is 832, which has a optimum utilization of all the cores available.

Hope this helps...Do comment if you have any query..

Cheers.

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

Hi @Bartosz Maciejewski​ 

Great to meet you, and thanks for your question! 

Let's see if your peers in the community have an answer to your question first. Or else bricksters will get back to you soon.

Thanks

UmaMahesh1
Honored Contributor III

Hi @Bartosz Maciejewski​ 

Generally we arrive at the number of shuffle partitions using the following method.

Input Size Data - 100 GB

Ideal partition target size - 128 MB

Cores - 8

Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804

To utiltize the cores available properly especially in the last iteration, the number of shuffle partitions should be a factor of the core count else we would not be using the cores properly. Giving too few partition will lead to less concurrency and too many will lead to lots of shuffle.

As for the above example you are referring to, if you calculate ideal number of partitions giving the proper input data size and desired target size (64 MB or 128 MB or whatever below 500 MB), it should come around 825.

Now factor of 8 cores near to 825 is either 824 or 832. If you gave 824, then last iteration would be allocated to the 825th partition alone where 7 out of 8 cores would be idle. So we are going with the next factor which is 832, which has a optimum utilization of all the cores available.

Hope this helps...Do comment if you have any query..

Cheers.

Hi @Uma Maheswara Rao Desula​ 

thank you for the response!

I have checked the real size of the loaded dataset and it was a little bit more than 103 GB. Using you fomula:

(103,1*1024)/128 = 824,8

This is why in the exercise there was 825 partitions and then, as you pointed out, the closest factor of 8 is 832.

Now everything is clear for me - I like to know exactly why each metric was selected 🙂

Cheers

Bartek

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.