07-19-2023 12:14 AM - edited 07-19-2023 12:16 AM
Hi,
When caching a DataFrame, I always use "df.cache().count()".
However, in this reference, it is suggested to save the cached DataFrame into a new variable:
I didn't understand well the logic behind it. And I couldn't find similar suggestions in other articles.
My question is what the best practice is when using caching?
07-19-2023 12:28 AM
the best practice when using caching is making sure the cache is used. That is what David explains in the article you referenced.
By assigning the cached data to a new df, you can easily view the analyzed plan, which is used to read from cache.
If you do not assign the cache to a new df, this is harder to view, as the example in the article shows and you might think cache is used while it is not.
07-19-2023 04:44 AM
Thanks for your help.
If I assign cached DataFrame into a new variable, which one should I use in the next cells for next transformations. For example,
Then, which of the following options is correct for next transformations:
df2 = (df .join(...) .select(...) .filter(...) )
or:
df2 = (cached_df .join(...) .select(...) .filter(...) )
07-19-2023 12:33 AM
Hi @Mado
Whenever we wants to display we can do cache() of that dataframe that will ensure that this particular df is cached.
And also as you mentioned once you cached new dataframe df_new, you can unpersist the earlier df after caching df_new.
07-19-2023 11:38 AM
In addition to other comments, I will just add that make sure you do the cache only when necessary. i.e. if you need to save a data frame for a time being to be referenced later in the code, then you should consider doing a cache. But if your code has only dataframe being used then we don't need to do cache.
Unnecessary can add to the problem rather than helping.
07-20-2023 12:18 AM
Totally agree on this.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now