โ01-28-2022 04:02 AM
โ01-28-2022 05:02 AM
2GB is quite small so usually default settings are the best (so in most cases better result is not to set anything like repartition etc. and leave everything to catalyst optimizer). If you want to set custom partitioning:
There are many videos on databricks youtube channel about that (some really long).
In my opinion most important and easiest to analyze are data skews and it also can affect situation when everything is managed automatically (we can see it in Spark UI that 1 partition was processed in 1 second and other in 1 minute for example). It can happen when we read data which is wrong portioned (common problem is that people in sql jdbc read set lowerBound and upperBound to divide ids for example to buckets per 1000 ids in partition but with time ids are growing so it make data skew as in every partition will be 1000 rows but in last one which will take all rows above upperBond can have millions...)
โ01-28-2022 05:02 AM
2GB is quite small so usually default settings are the best (so in most cases better result is not to set anything like repartition etc. and leave everything to catalyst optimizer). If you want to set custom partitioning:
There are many videos on databricks youtube channel about that (some really long).
In my opinion most important and easiest to analyze are data skews and it also can affect situation when everything is managed automatically (we can see it in Spark UI that 1 partition was processed in 1 second and other in 1 minute for example). It can happen when we read data which is wrong portioned (common problem is that people in sql jdbc read set lowerBound and upperBound to divide ids for example to buckets per 1000 ids in partition but with time ids are growing so it make data skew as in every partition will be 1000 rows but in last one which will take all rows above upperBond can have millions...)
โ01-28-2022 06:11 AM
@Hubert Dudekโ So spark automatically partition the data and that depends on the cpus. So for example if I have cluster with 4 nodes and I have 2 cores per node that means spark will make 8 partitions for my dataframe?
โ01-28-2022 06:15 AM
@Hubert Dudekโ Can you explain more for that:
โ01-28-2022 07:06 AM
"DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling (join(), union(), aggregation functions). ."
it is from https://sparkbyexamples.com/spark/spark-shuffle-partitions/
4 workers with 2 cores in many cases will give 8 partitions (but not always ๐ ). You can always validate it by running YourDataframe.rdd.getNumPartitions
There is more on mentioned link.
That video is also good to watch (although it is more advanced) https://www.youtube.com/watch?v=daXEp4HmS-E
โ01-28-2022 08:28 AM
Great!
โ01-28-2022 11:04 AM
@Hubert Dudekโ Can you tell me how to calculate how many cores I need for my data and how many input partitions, shuffles partitions, and output partitions I need. In most cases, I work with data between 500MB and 10GB.
โ02-01-2022 02:19 AM
In such a scenario the best is to set default spark settings. Regarding cluster size I would set autoscale. 500MB and 10GB is not so big, so adding more cpus can speedup process but if speed is not issue (for example night etl on spot instances) I would stick with machine with 4 cpus. As workers write partition data to disk you can choose machine with SSD storage as it is often bottleneck.
โ02-01-2022 02:55 AM
Thank you!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group