This may be a tricky question, so please bear with me
In a real life scenario, i have a dataframe (i'm using pyspark) called age, with is a groupBy of other 4 dataframes. I join these 4 so at the end I have a few million rows, but after the groupBy the numbers are reduced for some 200 rows.
I then save this dataframe to an s3 bucket.
The question now is:
what is quicker: performing more groupBy in this dataframe, or querying the data i just saved in s3 and then applying the groupBy to it?
The final goal is to save this second groupBy in s3 too.