cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Tuning shuffle partitions

Anonymous
Not applicable

Is the best practice for tuning shuffle partitions to have the config "autoOptimizeShuffle.enabled" on? I see it is not switched on by default. Why is that?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

 AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.

However it does not set the map-side partition number automatically today. Hence it is recommended to set initial shuffle partition number  through the SQL config  spark.sql.shuffle.partitions. Now Databricks has a feature to โ€œAuto-Optimized Shuffleโ€ ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . However, if you want to hand tune you could set spark.sql.shuffle.partitions manually.

View solution in original post

7 REPLIES 7

sajith_appukutt
Honored Contributor II

 AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.

However it does not set the map-side partition number automatically today. Hence it is recommended to set initial shuffle partition number  through the SQL config  spark.sql.shuffle.partitions. Now Databricks has a feature to โ€œAuto-Optimized Shuffleโ€ ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . However, if you want to hand tune you could set spark.sql.shuffle.partitions manually.

I'm getting a tip in a notebook I'm running as follows:

Shuffle partition number too small: We recommend enabling Auto-Optimized Shuffle by setting 'spark.sql.shuffle.partitions=auto' or changing 'spark.sql.shuffle.partitions' to 10581 or higher.

 


@sajith_appukutt wrote:

Now Databricks has a feature to โ€œAuto-Optimized Shuffleโ€ ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . 


So, what is the distinction between 'spark.sql.shuffle.partitions=auto', AQE, and spark.databricks.adaptive.autoOptimizeShuffle.enabled?

mtajmouati
New Contributor II

Hi,

autoOptimizeShuffle.enabled The autoOptimizeShuffle.enabled configuration in Databricks is designed to automatically optimize the number of shuffle partitions based on the data size and the number of available executors. This can help avoid the common issue of having too many or too few partitions, which can lead to inefficiencies. Best Practices Understand Your Workload: Before relying on auto-optimization, understand the characteristics of your workload. Auto-optimization might not be the best for all types of workloads. Enable Auto Optimization: If you decide to use autoOptimizeShuffle.enabled, enable it through the Databricks configuration. This can be done as follows:

 

spark.conf.set("spark.databricks.optimizer.autoOptimizeShuffle.enabled", "true")

 

 

Mehdi TAJMOUATI
https://www.wytasoft.com/wytasoft-group/

lprevost
New Contributor III

That's not what I asked.   I was asking what is the difference between the various seemingly similar settings of:

'spark.sql.shuffle.partitions=auto',

AQE, and 

spark.databricks.adaptive.autoOptimizeShuffle.enabled

mtajmouati
New Contributor II

Hello,

I did two slide to explain the difference :

mtajmouati_0-1720035744520.png

mtajmouati_1-1720035892865.png

 

Mehdi TAJMOUATI
https://www.wytasoft.com/wytasoft-group/

lprevost
New Contributor III

if I have a streaming table, auto wouldn't work would it?  I should set the partitions based on the size of the stream I expect?

mtajmouati
New Contributor II

AQE applies to all queries that are:

  • Non-streaming

  • Contain at least one exchange (usually when thereโ€™s a join, aggregate, or window), one sub-query, or both.

Not all AQE-applied queries are necessarily re-optimized. The re-optimization might or might not come up with a different query plan than the one statically compiled. To determine whether a queryโ€™s plan has been changed by AQE

But you can use spark.sql.shuffle.partitions in auto or using manual settings

Mehdi TAJMOUATI
https://www.wytasoft.com/wytasoft-group/
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!