Databricks Community

Anonymous · ‎06-18-2021

Is the best practice for tuning shuffle partitions to have the config "autoOptimizeShuffle.enabled" on? I see it is not switched on by default. Why is that?

sajith_appukutt · ‎06-21-2021

AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.

However it does not set the map-side partition number automatically today. Hence it is recommended to set initial shuffle partition number through the SQL config spark.sql.shuffle.partitions. Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . However, if you want to hand tune you could set spark.sql.shuffle.partitions manually.

View solution in original post

sajith_appukutt · ‎06-21-2021

AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.

However it does not set the map-side partition number automatically today. Hence it is recommended to set initial shuffle partition number through the SQL config spark.sql.shuffle.partitions. Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . However, if you want to hand tune you could set spark.sql.shuffle.partitions manually.

lprevost · ‎07-02-2024

I'm getting a tip in a notebook I'm running as follows:

Shuffle partition number too small: We recommend enabling Auto-Optimized Shuffle by setting 'spark.sql.shuffle.partitions=auto' or changing 'spark.sql.shuffle.partitions' to 10581 or higher.

@sajith_appukutt wrote:
Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient .

So, what is the distinction between 'spark.sql.shuffle.partitions=auto', AQE, and spark.databricks.adaptive.autoOptimizeShuffle.enabled?

mtajmouati · ‎07-02-2024

Hi,

autoOptimizeShuffle.enabled The autoOptimizeShuffle.enabled configuration in Databricks is designed to automatically optimize the number of shuffle partitions based on the data size and the number of available executors. This can help avoid the common issue of having too many or too few partitions, which can lead to inefficiencies. Best Practices Understand Your Workload: Before relying on auto-optimization, understand the characteristics of your workload. Auto-optimization might not be the best for all types of workloads. Enable Auto Optimization: If you decide to use autoOptimizeShuffle.enabled, enable it through the Databricks configuration. This can be done as follows:

spark.conf.set("spark.databricks.optimizer.autoOptimizeShuffle.enabled", "true")

lprevost · ‎07-03-2024

That's not what I asked. I was asking what is the difference between the various seemingly similar settings of:

'spark.sql.shuffle.partitions=auto',

AQE, and

spark.databricks.adaptive.autoOptimizeShuffle.enabled

mtajmouati · ‎07-03-2024

Hello,

I did two slide to explain the difference :

lprevost · ‎07-03-2024

if I have a streaming table, auto wouldn't work would it? I should set the partitions based on the size of the stream I expect?

mtajmouati · ‎07-03-2024

AQE applies to all queries that are:

Non-streaming
Contain at least one exchange (usually when there’s a join, aggregate, or window), one sub-query, or both.

Not all AQE-applied queries are necessarily re-optimized. The re-optimization might or might not come up with a different query plan than the one statically compiled. To determine whether a query’s plan has been changed by AQE

But you can use spark.sql.shuffle.partitions in auto or using manual settings