cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Are window functions more performant than self joins?

matthewe97
New Contributor

I have a table with data for each month end and want to know the LEAD and LAG data points either side of each month. For example:

SELECT month_date,
LEAD(month_date) OVER (PARTITION BY id ORDER BY month_date) next_month_date,
 LAG(month_date) OVER (PARTITION BY id ORDER BY month_date) previous_month_date
FROM t

I want to know if this is better than:

SELECT month_date,
t2.month_date AS next_month_date,
t3.month_date AS previous_month_date
FROM t
LEFT JOIN t2 ON t.id = t2.id AND t.month_date = ADD_MONTHS(t.month_date,1)
LEFT JOIN t3 ON t.id = t3.id AND t.month_date = ADD_MONTHS(t.month_date,-1)

1 ACCEPTED SOLUTION

Accepted Solutions

pvignesh92
Honored Contributor

Hi @Matthew Elsham​, As @Lakshay Goel​ pointed out, I would believe Windows would work better as it will first partition them based on the partition key and then your aggregation happens within that partition on a single worker. But it's good to see your query plan for both these cases and understand with your data what works best for you.

One good blog I checked was this - https://blog.knoldus.com/using-windows-in-spark-to-avoid-joins/. Please have a look.

View solution in original post

3 REPLIES 3

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Matthew Elsham​ , In this case, I would expect the window functions to do better because you are doing 2 joins in the second query.

With that being said, the performance of joins is highly dependent on the type of join strategy and the volume of data to be joined. But window functions will give you much cleaner code to work with.

pvignesh92
Honored Contributor

Hi @Matthew Elsham​, As @Lakshay Goel​ pointed out, I would believe Windows would work better as it will first partition them based on the partition key and then your aggregation happens within that partition on a single worker. But it's good to see your query plan for both these cases and understand with your data what works best for you.

One good blog I checked was this - https://blog.knoldus.com/using-windows-in-spark-to-avoid-joins/. Please have a look.

Anonymous
Not applicable

Hi @Matthew Elsham​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!