Databricks Community

Mado · ‎07-19-2023

Hi,

When caching a DataFrame, I always use "df.cache().count()".

However, in this reference, it is suggested to save the cached DataFrame into a new variable:

When you cache a DataFrame create a new variable for it cachedDF = df.cache(). This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. Here whenever you call cachedDF.select(…) it will leverage the cached data.

I didn't understand well the logic behind it. And I couldn't find similar suggestions in other articles.

My question is what the best practice is when using caching?

-werners- · ‎07-19-2023

the best practice when using caching is making sure the cache is used. That is what David explains in the article you referenced.
By assigning the cached data to a new df, you can easily view the analyzed plan, which is used to read from cache.
If you do not assign the cache to a new df, this is harder to view, as the example in the article shows and you might think cache is used while it is not.

Mado · ‎07-19-2023

@-werners-

Thanks for your help.

If I assign cached DataFrame into a new variable, which one should I use in the next cells for next transformations. For example,

cached_df = df.cache()
cached_df.count()

Then, which of the following options is correct for next transformations:

df2 = (df
  .join(...)
  .select(...)
  .filter(...)         )

or:

df2 = (cached_df
  .join(...)
  .select(...)
  .filter(...)         )

Vinay_M_R · ‎07-19-2023

Hi @Mado

Whenever we wants to display we can do cache() of that dataframe that will ensure that this particular df is cached.

And also as you mentioned once you cached new dataframe df_new, you can unpersist the earlier df after caching df_new.

Lakshay · ‎07-19-2023

In addition to other comments, I will just add that make sure you do the cache only when necessary. i.e. if you need to save a data frame for a time being to be referenced later in the code, then you should consider doing a cache. But if your code has only dataframe being used then we don't need to do cache.

Unnecessary can add to the problem rather than helping.