Databricks Community

BorislavBlagoev · ‎01-28-2022

How much data is too much for spark and what is the best strategy to partition 2GB data?

Hubert-Dudek · ‎01-28-2022

2GB is quite small so usually default settings are the best (so in most cases better result is not to set anything like repartition etc. and leave everything to catalyst optimizer). If you want to set custom partitioning:

please remember about avoiding data skews (that one partition is bigger than other),
think about avoiding data shuffles - as custom partitioning can sometimes help to avoid shuffle but often can also increase shuffle
1 core/thread is working on 1 partition so it have sense to have number of partitions = number of cpus but it is set automatically like that
when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network

There are many videos on databricks youtube channel about that (some really long).

In my opinion most important and easiest to analyze are data skews and it also can affect situation when everything is managed automatically (we can see it in Spark UI that 1 partition was processed in 1 second and other in 1 minute for example). It can happen when we read data which is wrong portioned (common problem is that people in sql jdbc read set lowerBound and upperBound to divide ids for example to buckets per 1000 ids in partition but with time ids are growing so it make data skew as in every partition will be 1000 rows but in last one which will take all rows above upperBond can have millions...)

View solution in original post

Hubert-Dudek · ‎01-28-2022

2GB is quite small so usually default settings are the best (so in most cases better result is not to set anything like repartition etc. and leave everything to catalyst optimizer). If you want to set custom partitioning:

please remember about avoiding data skews (that one partition is bigger than other),
think about avoiding data shuffles - as custom partitioning can sometimes help to avoid shuffle but often can also increase shuffle
1 core/thread is working on 1 partition so it have sense to have number of partitions = number of cpus but it is set automatically like that
when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network

There are many videos on databricks youtube channel about that (some really long).

In my opinion most important and easiest to analyze are data skews and it also can affect situation when everything is managed automatically (we can see it in Spark UI that 1 partition was processed in 1 second and other in 1 minute for example). It can happen when we read data which is wrong portioned (common problem is that people in sql jdbc read set lowerBound and upperBound to divide ids for example to buckets per 1000 ids in partition but with time ids are growing so it make data skew as in every partition will be 1000 rows but in last one which will take all rows above upperBond can have millions...)

BorislavBlagoev · ‎01-28-2022

@Hubert Dudek So spark automatically partition the data and that depends on the cpus. So for example if I have cluster with 4 nodes and I have 2 cores per node that means spark will make 8 partitions for my dataframe?

BorislavBlagoev · ‎01-28-2022

@Hubert Dudek Can you explain more for that:

when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network

Hubert-Dudek · ‎01-28-2022

"DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling (join(), union(), aggregation functions). ."

it is from https://sparkbyexamples.com/spark/spark-shuffle-partitions/

4 workers with 2 cores in many cases will give 8 partitions (but not always 🙂 ). You can always validate it by running YourDataframe.rdd.getNumPartitions

There is more on mentioned link.

That video is also good to watch (although it is more advanced) https://www.youtube.com/watch?v=daXEp4HmS-E

BorislavBlagoev · ‎01-28-2022

Great!

BorislavBlagoev · ‎01-28-2022

@Hubert Dudek Can you tell me how to calculate how many cores I need for my data and how many input partitions, shuffles partitions, and output partitions I need. In most cases, I work with data between 500MB and 10GB.

Hubert-Dudek · ‎02-01-2022

In such a scenario the best is to set default spark settings. Regarding cluster size I would set autoscale. 500MB and 10GB is not so big, so adding more cpus can speedup process but if speed is not issue (for example night etl on spot instances) I would stick with machine with 4 cpus. As workers write partition data to disk you can choose machine with SSD storage as it is often bottleneck.