cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Ideal number and size of partitions

Anonymous
Not applicable

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So how do I figure out what the ideal partition size should be?

Ideal partition size is expected to be 128 MB to 1 GB. How do I go about making sure that partition size falls in this category?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the value

However if you have multiple workloads with different data volumes, instead of manually specifying the configuration for each of these, it is worth looking at AQE & Auto-Optimized Shuffle

AQE adjusts the shufzfle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small. However, AQE does not change the initial partition number by default - so if you are seeing spilling in your jobs you could enable auto optimized shuffle by setting <db_prefix>.autoOptimizeShuffle.enabled to true.

More details at

https://databricks.com/blog/2020/10/21/faster-sql-adaptive-query-execution-in-databricks.html

View solution in original post

1 REPLY 1

sajith_appukutt
Honored Contributor II

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the value

However if you have multiple workloads with different data volumes, instead of manually specifying the configuration for each of these, it is worth looking at AQE & Auto-Optimized Shuffle

AQE adjusts the shufzfle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small. However, AQE does not change the initial partition number by default - so if you are seeing spilling in your jobs you could enable auto optimized shuffle by setting <db_prefix>.autoOptimizeShuffle.enabled to true.

More details at

https://databricks.com/blog/2020/10/21/faster-sql-adaptive-query-execution-in-databricks.html

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.