<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Window function VS groupBy + map in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/window-function-vs-groupby-map/m-p/114887#M44979</link>
    <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;P&gt;Let's say we have an RDD like this:&lt;/P&gt;&lt;PRE&gt;RDD(id: Int, measure: Int, date: LocalDate)&lt;/PRE&gt;&lt;P&gt;Let's say we want to apply some function that compares 2 consecutive measures by date, outputs a number and we want to get the sum of those numbers by id. The function is basically:&lt;/P&gt;&lt;PRE&gt;foo(measure1: Int, measure2: Int): Int&lt;/PRE&gt;&lt;P&gt;Consider the following 2 solutions:&lt;/P&gt;&lt;P&gt;1- Use sparkSQL:&lt;/P&gt;&lt;PRE&gt;SELECT id, SUM(foo(measure, LAG(measure) OVER(PARTITION BY id ORDER BY date)))
FROM rdd
GROUP BY id&lt;/PRE&gt;&lt;P&gt;2- Use the RDD api:&lt;/P&gt;&lt;PRE&gt;rdd
.groupBy(_.id)
.mapValues{case vals =&amp;gt;
  val sorted = vals.sortBy(_.date)
  sorted.zipWithIndex.foldLeft(0){
    case (acc, (_, 0)) =&amp;gt; acc
    case (acc, (record, index)) if  index &amp;gt; 0 =&amp;gt;
      acc + foo(sorted(index - 1).measure, record.measure)
  }
}&lt;/PRE&gt;&lt;P&gt;My question is: Are both solutions equivalent under the hood? In pure terms of MapReduce operations, is there any difference between both? Im assuming solution 1 is literally syntactic svgar for what solution 2 is doing, is that correct?&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Wed, 09 Apr 2025 04:53:19 GMT</pubDate>
    <dc:creator>valde</dc:creator>
    <dc:date>2025-04-09T04:53:19Z</dc:date>
    <item>
      <title>Window function VS groupBy + map</title>
      <link>https://community.databricks.com/t5/data-engineering/window-function-vs-groupby-map/m-p/114887#M44979</link>
      <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;P&gt;Let's say we have an RDD like this:&lt;/P&gt;&lt;PRE&gt;RDD(id: Int, measure: Int, date: LocalDate)&lt;/PRE&gt;&lt;P&gt;Let's say we want to apply some function that compares 2 consecutive measures by date, outputs a number and we want to get the sum of those numbers by id. The function is basically:&lt;/P&gt;&lt;PRE&gt;foo(measure1: Int, measure2: Int): Int&lt;/PRE&gt;&lt;P&gt;Consider the following 2 solutions:&lt;/P&gt;&lt;P&gt;1- Use sparkSQL:&lt;/P&gt;&lt;PRE&gt;SELECT id, SUM(foo(measure, LAG(measure) OVER(PARTITION BY id ORDER BY date)))
FROM rdd
GROUP BY id&lt;/PRE&gt;&lt;P&gt;2- Use the RDD api:&lt;/P&gt;&lt;PRE&gt;rdd
.groupBy(_.id)
.mapValues{case vals =&amp;gt;
  val sorted = vals.sortBy(_.date)
  sorted.zipWithIndex.foldLeft(0){
    case (acc, (_, 0)) =&amp;gt; acc
    case (acc, (record, index)) if  index &amp;gt; 0 =&amp;gt;
      acc + foo(sorted(index - 1).measure, record.measure)
  }
}&lt;/PRE&gt;&lt;P&gt;My question is: Are both solutions equivalent under the hood? In pure terms of MapReduce operations, is there any difference between both? Im assuming solution 1 is literally syntactic svgar for what solution 2 is doing, is that correct?&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 09 Apr 2025 04:53:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/window-function-vs-groupby-map/m-p/114887#M44979</guid>
      <dc:creator>valde</dc:creator>
      <dc:date>2025-04-09T04:53:19Z</dc:date>
    </item>
    <item>
      <title>Re: Window function VS groupBy + map</title>
      <link>https://community.databricks.com/t5/data-engineering/window-function-vs-groupby-map/m-p/114973#M45002</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/17480"&gt;@valde&lt;/a&gt;, those two approaches give the same result, but they don’t work the same way under the hood. SparkSQL uses optimized window functions that handle things like shuffling and memory more efficiently, often making it faster and lighter.On the other hand, the RDD API does things manually, like sorting and grouping, which can be slower and more prone to issues like data skew unless you're careful.&amp;nbsp;&lt;/P&gt;&lt;P&gt;SparkSQL is usually better for large datasets. I would say use RDDs only when handling complex skew (due to their granular control) or logic not expressible in SQL.&lt;/P&gt;</description>
      <pubDate>Wed, 09 Apr 2025 13:25:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/window-function-vs-groupby-map/m-p/114973#M45002</guid>
      <dc:creator>Renu_</dc:creator>
      <dc:date>2025-04-09T13:25:23Z</dc:date>
    </item>
  </channel>
</rss>

