Window function VS groupBy + map

valde — Wed, 09 Apr 2025 04:53:19 GMT

Let's say we have an RDD like this:

RDD(id: Int, measure: Int, date: LocalDate)

Let's say we want to apply some function that compares 2 consecutive measures by date, outputs a number and we want to get the sum of those numbers by id. The function is basically:

foo(measure1: Int, measure2: Int): Int

Consider the following 2 solutions:

1- Use sparkSQL:

SELECT id, SUM(foo(measure, LAG(measure) OVER(PARTITION BY id ORDER BY date)))
FROM rdd
GROUP BY id

2- Use the RDD api:

rdd
.groupBy(_.id)
.mapValues{case vals =>
  val sorted = vals.sortBy(_.date)
  sorted.zipWithIndex.foldLeft(0){
    case (acc, (_, 0)) => acc
    case (acc, (record, index)) if  index > 0 =>
      acc + foo(sorted(index - 1).measure, record.measure)
  }
}

My question is: Are both solutions equivalent under the hood? In pure terms of MapReduce operations, is there any difference between both? Im assuming solution 1 is literally syntactic svgar for what solution 2 is doing, is that correct?

Re: Window function VS groupBy + map

Renu_ — Wed, 09 Apr 2025 13:25:23 GMT

Hi @valde, those two approaches give the same result, but they don’t work the same way under the hood. SparkSQL uses optimized window functions that handle things like shuffling and memory more efficiently, often making it faster and lighter.On the other hand, the RDD API does things manually, like sorting and grouping, which can be slower and more prone to issues like data skew unless you're careful.

SparkSQL is usually better for large datasets. I would say use RDDs only when handling complex skew (due to their granular control) or logic not expressible in SQL.

topic Window function VS groupBy + map in Data Engineering

Window function VS groupBy + map

Re: Window function VS groupBy + map