Window function VS groupBy + map
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Let's say we have an RDD like this:
RDD(id: Int, measure: Int, date: LocalDate)
Let's say we want to apply some function that compares 2 consecutive measures by date, outputs a number and we want to get the sum of those numbers by id. The function is basically:
foo(measure1: Int, measure2: Int): Int
Consider the following 2 solutions:
1- Use sparkSQL:
SELECT id, SUM(foo(measure, LAG(measure) OVER(PARTITION BY id ORDER BY date))) FROM rdd GROUP BY id
2- Use the RDD api:
rdd .groupBy(_.id) .mapValues{case vals => val sorted = vals.sortBy(_.date) sorted.zipWithIndex.foldLeft(0){ case (acc, (_, 0)) => acc case (acc, (record, index)) if index > 0 => acc + foo(sorted(index - 1).measure, record.measure) } }
My question is: Are both solutions equivalent under the hood? In pure terms of MapReduce operations, is there any difference between both? Im assuming solution 1 is literally syntactic svgar for what solution 2 is doing, is that correct?
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi @valde, those two approaches give the same result, but they don’t work the same way under the hood. SparkSQL uses optimized window functions that handle things like shuffling and memory more efficiently, often making it faster and lighter.On the other hand, the RDD API does things manually, like sorting and grouping, which can be slower and more prone to issues like data skew unless you're careful.
SparkSQL is usually better for large datasets. I would say use RDDs only when handling complex skew (due to their granular control) or logic not expressible in SQL.

