cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Ideal number and size of partitions

Anonymous
Not applicable

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So how do I figure out what the ideal partition size should be?

Ideal partition size is expected to be 128 MB to 1 GB. How do I go about making sure that partition size falls in this category?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the value

However if you have multiple workloads with different data volumes, instead of manually specifying the configuration for each of these, it is worth looking at AQE & Auto-Optimized Shuffle

AQE adjusts the shufzfle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small. However, AQE does not change the initial partition number by default - so if you are seeing spilling in your jobs you could enable auto optimized shuffle by setting <db_prefix>.autoOptimizeShuffle.enabled to true.

More details at

https://databricks.com/blog/2020/10/21/faster-sql-adaptive-query-execution-in-databricks.html

View solution in original post

1 REPLY 1

sajith_appukutt
Honored Contributor II

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the value

However if you have multiple workloads with different data volumes, instead of manually specifying the configuration for each of these, it is worth looking at AQE & Auto-Optimized Shuffle

AQE adjusts the shufzfle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small. However, AQE does not change the initial partition number by default - so if you are seeing spilling in your jobs you could enable auto optimized shuffle by setting <db_prefix>.autoOptimizeShuffle.enabled to true.

More details at

https://databricks.com/blog/2020/10/21/faster-sql-adaptive-query-execution-in-databricks.html

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group