Databricks Community

markdias · ‎10-06-2022

This may be a tricky question, so please bear with me

In a real life scenario, i have a dataframe (i'm using pyspark) called age, with is a groupBy of other 4 dataframes. I join these 4 so at the end I have a few million rows, but after the groupBy the numbers are reduced for some 200 rows.

I then save this dataframe to an s3 bucket.

The question now is:

what is quicker: performing more groupBy in this dataframe, or querying the data i just saved in s3 and then applying the groupBy to it?

The final goal is to save this second groupBy in s3 too.

Hubert-Dudek · ‎10-14-2022

'with is a groupBy of other 4 dataframes' I don't understand it, you can share code.

Faster will be to process everything in one goal usually.

Anonymous · ‎11-15-2022

Hi @Marcos Dias

Hope all is well!

Does @Hubert Dudek (Customer) response were able to resolve your issue, and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!