Before Optimization:
Consider a query that calculates the sum of a column value partitioned by category without an ORDER BY clause or a window_frame parameter:
SELECT category, SUM(value) OVER (PARTITION BY category) AS total_value
FROM sales;
In this case, the query plan would involve a full window function execution, which can be computationally expensive.
After Optimization:
With the optimization enabled, the query can be rewritten to use an aggregate function instead, which improves performance by leveraging partial aggregation:
SELECT category, total_value
FROM (
SELECT category, SUM(value) AS total_value
FROM sales
GROUP BY category
) AS aggregated_sales;
This rewritten query avoids the overhead of running a window function by using a simple aggregation, which is more efficient.
The optimization works by rewriting eligible window functions (those without an ORDER BY clause or a window_frame parameter) to use aggregate functions. This change allows the query to run faster by using partial aggregation and avoiding the overhead associated with window functions. The Spark configuration parameter spark.databricks.optimizer.replaceWindowsWithAggregates.enabled controls this optimization and is set to true by default. To turn this optimization off, set spark.databricks.optimizer.replaceWindowsWithAggregates.enabled to false