cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Number of partitions in Spark UI Simulator experiment

Bartek
Contributor

I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1โ€‹596 about data skew and in command 2 there is comment about how many partitions will be set as default:

// Factor of 8 cores and greater than the expected 825 partitions

What I don't understand is the second part of this sentence: "expected 825 partitions". What is the source of this number? How was it calculated?

obraz

1 ACCEPTED SOLUTION

Accepted Solutions

UmaMahesh1
Honored Contributor III

Hi @Bartosz Maciejewskiโ€‹ 

Generally we arrive at the number of shuffle partitions using the following method.

Input Size Data - 100 GB

Ideal partition target size - 128 MB

Cores - 8

Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804

To utiltize the cores available properly especially in the last iteration, the number of shuffle partitions should be a factor of the core count else we would not be using the cores properly. Giving too few partition will lead to less concurrency and too many will lead to lots of shuffle.

As for the above example you are referring to, if you calculate ideal number of partitions giving the proper input data size and desired target size (64 MB or 128 MB or whatever below 500 MB), it should come around 825.

Now factor of 8 cores near to 825 is either 824 or 832. If you gave 824, then last iteration would be allocated to the 825th partition alone where 7 out of 8 cores would be idle. So we are going with the next factor which is 832, which has a optimum utilization of all the cores available.

Hope this helps...Do comment if you have any query..

Cheers.

Uma Mahesh D

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

Hi @Bartosz Maciejewskiโ€‹ 

Great to meet you, and thanks for your question! 

Let's see if your peers in the community have an answer to your question first. Or else bricksters will get back to you soon.

Thanks

UmaMahesh1
Honored Contributor III

Hi @Bartosz Maciejewskiโ€‹ 

Generally we arrive at the number of shuffle partitions using the following method.

Input Size Data - 100 GB

Ideal partition target size - 128 MB

Cores - 8

Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804

To utiltize the cores available properly especially in the last iteration, the number of shuffle partitions should be a factor of the core count else we would not be using the cores properly. Giving too few partition will lead to less concurrency and too many will lead to lots of shuffle.

As for the above example you are referring to, if you calculate ideal number of partitions giving the proper input data size and desired target size (64 MB or 128 MB or whatever below 500 MB), it should come around 825.

Now factor of 8 cores near to 825 is either 824 or 832. If you gave 824, then last iteration would be allocated to the 825th partition alone where 7 out of 8 cores would be idle. So we are going with the next factor which is 832, which has a optimum utilization of all the cores available.

Hope this helps...Do comment if you have any query..

Cheers.

Uma Mahesh D

Hi @Uma Maheswara Rao Desulaโ€‹ 

thank you for the response!

I have checked the real size of the loaded dataset and it was a little bit more than 103 GB. Using you fomula:

(103,1*1024)/128 = 824,8

This is why in the exercise there was 825 partitions and then, as you pointed out, the closest factor of 8 is 832.

Now everything is clear for me - I like to know exactly why each metric was selected ๐Ÿ™‚

Cheers

Bartek

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group