- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-18-2021 02:12 PM
Is the best practice for tuning shuffle partitions to have the config "autoOptimizeShuffle.enabled" on? I see it is not switched on by default. Why is that?
- Labels:
-
Shuffle
-
Shuffle Partitions
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-21-2021 01:02 PM
AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.
However it does not set the map-side partition number automatically today. Hence it is recommended to set initial shuffle partition number through the SQL config spark.sql.shuffle.partitions. Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . However, if you want to hand tune you could set spark.sql.shuffle.partitions manually.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-21-2021 01:02 PM
AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.
However it does not set the map-side partition number automatically today. Hence it is recommended to set initial shuffle partition number through the SQL config spark.sql.shuffle.partitions. Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient . However, if you want to hand tune you could set spark.sql.shuffle.partitions manually.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-02-2024 09:26 AM
I'm getting a tip in a notebook I'm running as follows:
Shuffle partition number too small: We recommend enabling Auto-Optimized Shuffle by setting 'spark.sql.shuffle.partitions=auto' or changing 'spark.sql.shuffle.partitions' to 10581 or higher.
@sajith_appukutt wrote:Now Databricks has a feature to “Auto-Optimized Shuffle” ( spark.databricks.adaptive.autoOptimizeShuffle.enabled) which automates the need for setting this manually. For the vast majority of use cases, enabling this auto mode would be sufficient .
So, what is the distinction between 'spark.sql.shuffle.partitions=auto', AQE, and spark.databricks.adaptive.autoOptimizeShuffle.enabled?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-02-2024 01:42 PM - edited 07-02-2024 01:43 PM
Hi,
autoOptimizeShuffle.enabled The autoOptimizeShuffle.enabled configuration in Databricks is designed to automatically optimize the number of shuffle partitions based on the data size and the number of available executors. This can help avoid the common issue of having too many or too few partitions, which can lead to inefficiencies. Best Practices Understand Your Workload: Before relying on auto-optimization, understand the characteristics of your workload. Auto-optimization might not be the best for all types of workloads. Enable Auto Optimization: If you decide to use autoOptimizeShuffle.enabled, enable it through the Databricks configuration. This can be done as follows:
spark.conf.set("spark.databricks.optimizer.autoOptimizeShuffle.enabled", "true")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-03-2024 09:17 AM
That's not what I asked. I was asking what is the difference between the various seemingly similar settings of:
'spark.sql.shuffle.partitions=auto',
AQE, and
spark.databricks.adaptive.autoOptimizeShuffle.enabled
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-03-2024 12:47 PM
Hello,
I did two slide to explain the difference :
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-03-2024 12:56 PM
if I have a streaming table, auto wouldn't work would it? I should set the partitions based on the size of the stream I expect?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-03-2024 01:11 PM
AQE applies to all queries that are:
Non-streaming
Contain at least one exchange (usually when there’s a join, aggregate, or window), one sub-query, or both.
Not all AQE-applied queries are necessarily re-optimized. The re-optimization might or might not come up with a different query plan than the one statically compiled. To determine whether a query’s plan has been changed by AQE
But you can use spark.sql.shuffle.partitions in auto or using manual settings