cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Tuning shuffle partitions

Anonymous
Not applicable

Is the best practice for tuning shuffle partitions to have the config "autoOptimizeShuffle.enabled" on? I see it is not switched on by default. Why is that?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

 AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.

However it does not set the map-side partition number automatically today. Hence it is recommended to set initial shuffle partition number  through the SQL config  spark.sql.shuffle.partitions. Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . However, if you want to hand tune you could set spark.sql.shuffle.partitions manually.

View solution in original post

7 REPLIES 7

sajith_appukutt
Honored Contributor II

 AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.

However it does not set the map-side partition number automatically today. Hence it is recommended to set initial shuffle partition number  through the SQL config  spark.sql.shuffle.partitions. Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . However, if you want to hand tune you could set spark.sql.shuffle.partitions manually.

I'm getting a tip in a notebook I'm running as follows:

Shuffle partition number too small: We recommend enabling Auto-Optimized Shuffle by setting 'spark.sql.shuffle.partitions=auto' or changing 'spark.sql.shuffle.partitions' to 10581 or higher.

 


@sajith_appukutt wrote:

Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . 


So, what is the distinction between 'spark.sql.shuffle.partitions=auto', AQE, and spark.databricks.adaptive.autoOptimizeShuffle.enabled?

mtajmouati
Contributor

Hi,

autoOptimizeShuffle.enabled The autoOptimizeShuffle.enabled configuration in Databricks is designed to automatically optimize the number of shuffle partitions based on the data size and the number of available executors. This can help avoid the common issue of having too many or too few partitions, which can lead to inefficiencies. Best Practices Understand Your Workload: Before relying on auto-optimization, understand the characteristics of your workload. Auto-optimization might not be the best for all types of workloads. Enable Auto Optimization: If you decide to use autoOptimizeShuffle.enabled, enable it through the Databricks configuration. This can be done as follows:

 

spark.conf.set("spark.databricks.optimizer.autoOptimizeShuffle.enabled", "true")

 

 

That's not what I asked.   I was asking what is the difference between the various seemingly similar settings of:

'spark.sql.shuffle.partitions=auto',

AQE, and 

spark.databricks.adaptive.autoOptimizeShuffle.enabled

mtajmouati
Contributor

Hello,

I did two slide to explain the difference :

mtajmouati_0-1720035744520.png

mtajmouati_1-1720035892865.png

 

if I have a streaming table, auto wouldn't work would it?  I should set the partitions based on the size of the stream I expect?

mtajmouati
Contributor

AQE applies to all queries that are:

  • Non-streaming

  • Contain at least one exchange (usually when there’s a join, aggregate, or window), one sub-query, or both.

Not all AQE-applied queries are necessarily re-optimized. The re-optimization might or might not come up with a different query plan than the one statically compiled. To determine whether a query’s plan has been changed by AQE

But you can use spark.sql.shuffle.partitions in auto or using manual settings

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group