Databricks Community

OfirM · Monday

I have seen in the release notes of 15.3 that this was introduced and couldn't wrap my head around it.

Does someone has an example of a plan before and after?

Quote:

Performance improvement for some window functions

This release includes a change that improves the performance of some Spark window functions, specifically functions that do not include an ORDER BY clause or a window_frame parameter. In these cases, the system can rewrite the query to run it using an aggregate function. This change allows the query to run faster by using partial aggregation and avoiding the overhead of running window functions. The Spark configuration parameter spark.databricks.optimizer.replaceWindowsWithAggregates.enabled controls this optimization and is set to true by default. To turn this optimization off, set spark.databricks.optimizer.replaceWindowsWithAggregates.enabled to false.

Walter_C · Monday

Before Optimization:

Consider a query that calculates the sum of a column value partitioned by category without an ORDER BY clause or a window_frame parameter:

SELECT category, SUM(value) OVER (PARTITION BY category) AS total_value
FROM sales;

In this case, the query plan would involve a full window function execution, which can be computationally expensive.

After Optimization:

With the optimization enabled, the query can be rewritten to use an aggregate function instead, which improves performance by leveraging partial aggregation:

SELECT category, total_value
FROM (
    SELECT category, SUM(value) AS total_value
    FROM sales
    GROUP BY category
) AS aggregated_sales;

This rewritten query avoids the overhead of running a window function by using a simple aggregation, which is more efficient.

The optimization works by rewriting eligible window functions (those without an ORDER BY clause or a window_frame parameter) to use aggregate functions. This change allows the query to run faster by using partial aggregation and avoiding the overhead associated with window functions. The Spark configuration parameter spark.databricks.optimizer.replaceWindowsWithAggregates.enabled controls this optimization and is set to true by default. To turn this optimization off, set spark.databricks.optimizer.replaceWindowsWithAggregates.enabled to false