cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Do I have to run .cache() on my dataframe before returning aggregations like count?

User16826992666
Valued Contributor
 
2 REPLIES 2

Srikanth_Gupta_
Valued Contributor

Better to use cache when dataframe is used multiple times in a single pipeline.

Using  cache()  and persist()  methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.

sean_owen
Honored Contributor II
Honored Contributor II

You do not have to cache anything to make it work. You would decide that based on whether you want to spend memory/storage to avoid recomputing the DataFrame, like when you may use it in multiple operations afterwards.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.