cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What are the best practices for spark DataFrame caching?

Mado
Valued Contributor II

Hi,

When caching a DataFrame, I always use "df.cache().count()".

However, in this reference, it is suggested to save the cached DataFrame into a new variable:

  • When you cache a DataFrame create a new variable for it cachedDF = df.cache(). This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. Here whenever you call cachedDF.select(โ€ฆ) it will leverage the cached data.

I didn't understand well the logic behind it. And I couldn't find similar suggestions in other articles. 

My question is what the best practice is when using caching? 

5 REPLIES 5

-werners-
Esteemed Contributor III

the best practice when using caching is making sure the cache is used.  That is what David explains in the article you referenced.
By assigning the cached data to a new df, you can easily view the analyzed plan, which is used to read from cache.
If you do not assign the cache to a new df, this is harder to view, as the example in the article shows and you might think cache is used while it is not.

 

Mado
Valued Contributor II

@-werners- 

Thanks for your help. 

If I assign cached DataFrame into a new variable, which one should I use in the next cells for next transformations. For example, 

  • cached_df = df.cache()
  • cached_df.count()

Then, which of the following options is correct for next transformations:

df2 = (df
  .join(...)
  .select(...)
  .filter(...)         )

or:

df2 = (cached_df
  .join(...)
  .select(...)
  .filter(...)         )

 

Vinay_M_R
Valued Contributor II
Valued Contributor II

Hi  @Mado 

Whenever we wants to display we can do cache() of that dataframe that will ensure that this particular df is cached.

And also as you mentioned once you cached new dataframe df_new, you can unpersist the earlier df after caching df_new.

Lakshay
Esteemed Contributor
Esteemed Contributor

In addition to other comments, I will just add that make sure you do the cache only when necessary. i.e. if you need to save a data frame for a time being to be referenced later in the code, then you should consider doing a cache. But if your code has only dataframe being used then we don't need to do cache.

Unnecessary can add to the problem rather than helping.

-werners-
Esteemed Contributor III

Totally agree on this.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.