Databricks Community

valde · 2 weeks ago

Let's say we have an RDD like this:

RDD(id: Int, measure: Int, date: LocalDate)

Let's say we want to apply some function that compares 2 consecutive measures by date, outputs a number and we want to get the sum of those numbers by id. The function is basically:

foo(measure1: Int, measure2: Int): Int

Consider the following 2 solutions:

1- Use sparkSQL:

SELECT id, SUM(foo(measure, LAG(measure) OVER(PARTITION BY id ORDER BY date)))
FROM rdd
GROUP BY id

2- Use the RDD api:

rdd
.groupBy(_.id)
.mapValues{case vals =>
  val sorted = vals.sortBy(_.date)
  sorted.zipWithIndex.foldLeft(0){
    case (acc, (_, 0)) => acc
    case (acc, (record, index)) if  index > 0 =>
      acc + foo(sorted(index - 1).measure, record.measure)
  }
}

My question is: Are both solutions equivalent under the hood? In pure terms of MapReduce operations, is there any difference between both? Im assuming solution 1 is literally syntactic svgar for what solution 2 is doing, is that correct?

Renu_ · 2 weeks ago

Hi @valde, those two approaches give the same result, but they don’t work the same way under the hood. SparkSQL uses optimized window functions that handle things like shuffling and memory more efficiently, often making it faster and lighter.On the other hand, the RDD API does things manually, like sorting and grouping, which can be slower and more prone to issues like data skew unless you're careful.

SparkSQL is usually better for large datasets. I would say use RDDs only when handling complex skew (due to their granular control) or logic not expressible in SQL.

Databricks Community

Window function VS groupBy + map

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!