- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2022 04:02 AM
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2022 05:02 AM
2GB is quite small so usually default settings are the best (so in most cases better result is not to set anything like repartition etc. and leave everything to catalyst optimizer). If you want to set custom partitioning:
- please remember about avoiding data skews (that one partition is bigger than other),
- think about avoiding data shuffles - as custom partitioning can sometimes help to avoid shuffle but often can also increase shuffle
- 1 core/thread is working on 1 partition so it have sense to have number of partitions = number of cpus but it is set automatically like that
- when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network
There are many videos on databricks youtube channel about that (some really long).
In my opinion most important and easiest to analyze are data skews and it also can affect situation when everything is managed automatically (we can see it in Spark UI that 1 partition was processed in 1 second and other in 1 minute for example). It can happen when we read data which is wrong portioned (common problem is that people in sql jdbc read set lowerBound and upperBound to divide ids for example to buckets per 1000 ids in partition but with time ids are growing so it make data skew as in every partition will be 1000 rows but in last one which will take all rows above upperBond can have millions...)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2022 05:02 AM
2GB is quite small so usually default settings are the best (so in most cases better result is not to set anything like repartition etc. and leave everything to catalyst optimizer). If you want to set custom partitioning:
- please remember about avoiding data skews (that one partition is bigger than other),
- think about avoiding data shuffles - as custom partitioning can sometimes help to avoid shuffle but often can also increase shuffle
- 1 core/thread is working on 1 partition so it have sense to have number of partitions = number of cpus but it is set automatically like that
- when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network
There are many videos on databricks youtube channel about that (some really long).
In my opinion most important and easiest to analyze are data skews and it also can affect situation when everything is managed automatically (we can see it in Spark UI that 1 partition was processed in 1 second and other in 1 minute for example). It can happen when we read data which is wrong portioned (common problem is that people in sql jdbc read set lowerBound and upperBound to divide ids for example to buckets per 1000 ids in partition but with time ids are growing so it make data skew as in every partition will be 1000 rows but in last one which will take all rows above upperBond can have millions...)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2022 06:11 AM
@Hubert Dudek So spark automatically partition the data and that depends on the cpus. So for example if I have cluster with 4 nodes and I have 2 cores per node that means spark will make 8 partitions for my dataframe?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2022 06:15 AM
@Hubert Dudek Can you explain more for that:
- when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2022 07:06 AM
"DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling (join(), union(), aggregation functions). ."
it is from https://sparkbyexamples.com/spark/spark-shuffle-partitions/
4 workers with 2 cores in many cases will give 8 partitions (but not always 🙂 ). You can always validate it by running YourDataframe.rdd.getNumPartitions
There is more on mentioned link.
That video is also good to watch (although it is more advanced) https://www.youtube.com/watch?v=daXEp4HmS-E
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2022 08:28 AM
Great!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2022 11:04 AM
@Hubert Dudek Can you tell me how to calculate how many cores I need for my data and how many input partitions, shuffles partitions, and output partitions I need. In most cases, I work with data between 500MB and 10GB.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-01-2022 02:19 AM
In such a scenario the best is to set default spark settings. Regarding cluster size I would set autoscale. 500MB and 10GB is not so big, so adding more cpus can speedup process but if speed is not issue (for example night etl on spot instances) I would stick with machine with 4 cpus. As workers write partition data to disk you can choose machine with SSD storage as it is often bottleneck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-01-2022 02:55 AM
Thank you!

