Hi @valde, those two approaches give the same result, but they don’t work the same way under the hood. SparkSQL uses optimized window functions that handle things like shuffling and memory more efficiently, often making it faster and lighter.On the other hand, the RDD API does things manually, like sorting and grouping, which can be slower and more prone to issues like data skew unless you're careful.
SparkSQL is usually better for large datasets. I would say use RDDs only when handling complex skew (due to their granular control) or logic not expressible in SQL.