topic Re: Do I have to run .cache() on my dataframe before returning aggregations like count? in Data Engineering

Do I have to run .cache() on my dataframe before returning aggregations like count?

User16826992666 — Wed, 16 Jun 2021 23:08:38 GMT

Re: Do I have to run .cache() on my dataframe before returning aggregations like count?

Srikanth_Gupta_ — Thu, 17 Jun 2021 14:58:40 GMT

Better to use cache when dataframe is used multiple times in a single pipeline.

Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.

Re: Do I have to run .cache() on my dataframe before returning aggregations like count?

sean_owen — Thu, 17 Jun 2021 18:24:29 GMT

You do not have to cache anything to make it work. You would decide that based on whether you want to spend memory/storage to avoid recomputing the DataFrame, like when you may use it in multiple operations afterwards.