cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Spark data limits

BorislavBlagoev
Valued Contributor III

How much data is too much for spark and what is the best strategy to partition 2GB data?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

2GB is quite small so usually default settings are the best (so in most cases better result is not to set anything like repartition etc. and leave everything to catalyst optimizer). If you want to set custom partitioning:

  • please remember about avoiding data skews (that one partition is bigger than other),
  • think about avoiding data shuffles - as custom partitioning can sometimes help to avoid shuffle but often can also increase shuffle
  • 1 core/thread is working on 1 partition so it have sense to have number of partitions = number of cpus but it is set automatically like that
  • when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network

There are many videos on databricks youtube channel about that (some really long).

In my opinion most important and easiest to analyze are data skews and it also can affect situation when everything is managed automatically (we can see it in Spark UI that 1 partition was processed in 1 second and other in 1 minute for example). It can happen when we read data which is wrong portioned (common problem is that people in sql jdbc read set lowerBound and upperBound to divide ids for example to buckets per 1000 ids in partition but with time ids are growing so it make data skew as in every partition will be 1000 rows but in last one which will take all rows above upperBond can have millions...)

View solution in original post

8 REPLIES 8

Hubert-Dudek
Esteemed Contributor III

2GB is quite small so usually default settings are the best (so in most cases better result is not to set anything like repartition etc. and leave everything to catalyst optimizer). If you want to set custom partitioning:

  • please remember about avoiding data skews (that one partition is bigger than other),
  • think about avoiding data shuffles - as custom partitioning can sometimes help to avoid shuffle but often can also increase shuffle
  • 1 core/thread is working on 1 partition so it have sense to have number of partitions = number of cpus but it is set automatically like that
  • when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network

There are many videos on databricks youtube channel about that (some really long).

In my opinion most important and easiest to analyze are data skews and it also can affect situation when everything is managed automatically (we can see it in Spark UI that 1 partition was processed in 1 second and other in 1 minute for example). It can happen when we read data which is wrong portioned (common problem is that people in sql jdbc read set lowerBound and upperBound to divide ids for example to buckets per 1000 ids in partition but with time ids are growing so it make data skew as in every partition will be 1000 rows but in last one which will take all rows above upperBond can have millions...)

@Hubert Dudekโ€‹  So spark automatically partition the data and that depends on the cpus. So for example if I have cluster with 4 nodes and I have 2 cores per node that means spark will make 8 partitions for my dataframe?

@Hubert Dudekโ€‹ Can you explain more for that:

  • when there is many data shuffles you can have few hundreds partitions as it have to be shuffled and go thorough memory and network

Hubert-Dudek
Esteemed Contributor III

"DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling (join(), union(), aggregation functions). ."

it is from https://sparkbyexamples.com/spark/spark-shuffle-partitions/

4 workers with 2 cores in many cases will give 8 partitions (but not always ๐Ÿ™‚ ). You can always validate it by running YourDataframe.rdd.getNumPartitions

There is more on mentioned link.

That video is also good to watch (although it is more advanced) https://www.youtube.com/watch?v=daXEp4HmS-E

Great!

@Hubert Dudekโ€‹ Can you tell me how to calculate how many cores I need for my data and how many input partitions, shuffles partitions, and output partitions I need. In most cases, I work with data between 500MB and 10GB.

Hubert-Dudek
Esteemed Contributor III

In such a scenario the best is to set default spark settings. Regarding cluster size I would set autoscale. 500MB and 10GB is not so big, so adding more cpus can speedup process but if speed is not issue (for example night etl on spot instances) I would stick with machine with 4 cpus. As workers write partition data to disk you can choose machine with SSD storage as it is often bottleneck.

Thank you!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.