Databricks

Bartek · ‎11-26-2022

I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1596 about data skew and in command 2 there is comment about how many partitions will be set as default:

// Factor of 8 cores and greater than the expected 825 partitions

What I don't understand is the second part of this sentence: "expected 825 partitions". What is the source of this number? How was it calculated?

UmaMahesh1 · ‎11-27-2022

Hi @Bartosz Maciejewski

Generally we arrive at the number of shuffle partitions using the following method.

Input Size Data - 100 GB

Ideal partition target size - 128 MB

Cores - 8

Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804

To utiltize the cores available properly especially in the last iteration, the number of shuffle partitions should be a factor of the core count else we would not be using the cores properly. Giving too few partition will lead to less concurrency and too many will lead to lots of shuffle.

As for the above example you are referring to, if you calculate ideal number of partitions giving the proper input data size and desired target size (64 MB or 128 MB or whatever below 500 MB), it should come around 825.

Now factor of 8 cores near to 825 is either 824 or 832. If you gave 824, then last iteration would be allocated to the 825th partition alone where 7 out of 8 cores would be idle. So we are going with the next factor which is 832, which has a optimum utilization of all the cores available.

Hope this helps...Do comment if you have any query..

Cheers.

View solution in original post

Anonymous · ‎11-26-2022

Hi @Bartosz Maciejewski

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question first. Or else bricksters will get back to you soon.

Thanks

UmaMahesh1 · ‎11-27-2022

Hi @Bartosz Maciejewski

Generally we arrive at the number of shuffle partitions using the following method.

Input Size Data - 100 GB

Ideal partition target size - 128 MB

Cores - 8

Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804

To utiltize the cores available properly especially in the last iteration, the number of shuffle partitions should be a factor of the core count else we would not be using the cores properly. Giving too few partition will lead to less concurrency and too many will lead to lots of shuffle.

As for the above example you are referring to, if you calculate ideal number of partitions giving the proper input data size and desired target size (64 MB or 128 MB or whatever below 500 MB), it should come around 825.

Now factor of 8 cores near to 825 is either 824 or 832. If you gave 824, then last iteration would be allocated to the 825th partition alone where 7 out of 8 cores would be idle. So we are going with the next factor which is 832, which has a optimum utilization of all the cores available.

Hope this helps...Do comment if you have any query..

Cheers.