cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Which is quicker: grouping a table that is a join of several others or querying data?

markdias
New Contributor II

This may be a tricky question, so please bear with me

In a real life scenario, i have a dataframe (i'm using pyspark) called age, with is a groupBy of other 4 dataframes. I join these 4 so at the end I have a few million rows, but after the groupBy the numbers are reduced for some 200 rows.

I then save this dataframe to an s3 bucket.

The question now is:

what is quicker: performing more groupBy in this dataframe, or querying the data i just saved in s3 and then applying the groupBy to it?

The final goal is to save this second groupBy in s3 too.

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

'with is a groupBy of other 4 dataframes' I don't understand it, you can share code.

Faster will be to process everything in one goal usually.

Anonymous
Not applicable

Hi @Marcos Dias​ 

Hope all is well!

Does @Hubert Dudek (Customer)​ response were able to resolve your issue, and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

NhatHoang
Valued Contributor II

Hi @Marcos Dias​ ,

Frankly, I think we need more detail to answer your question:

  • Are these 4 dataframes​ updated their data?
  • How often you use the groupBy-dataframe?
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.