Before Optimization:
Consider a query that calculates the sum of a column value
partitioned by category
without an ORDER BY
clause or a window_frame
parameter:
SELECT category, SUM(value) OVER (PARTITION BY category) AS total_value
FROM sales;
In this case, the query plan would involve a full window function execution, which can be computationally expensive.
After Optimization:
With the optimization enabled, the query can be rewritten to use an aggregate function instead, which improves performance by leveraging partial aggregation:
SELECT category, total_value
FROM (
SELECT category, SUM(value) AS total_value
FROM sales
GROUP BY category
) AS aggregated_sales;
This rewritten query avoids the overhead of running a window function by using a simple aggregation, which is more efficient.
The optimization works by rewriting eligible window functions (those without an ORDER BY
clause or a window_frame
parameter) to use aggregate functions. This change allows the query to run faster by using partial aggregation and avoiding the overhead associated with window functions. The Spark configuration parameter spark.databricks.optimizer.replaceWindowsWithAggregates.enabled
controls this optimization and is set to true
by default. To turn this optimization off, set spark.databricks.optimizer.replaceWindowsWithAggregates.enabled
to false