cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

What are the best practices for spark DataFrame caching?

Mado
Valued Contributor II

Hi,

When caching a DataFrame, I always use "df.cache().count()".

However, in this reference, it is suggested to save the cached DataFrame into a new variable:

  • When you cache a DataFrame create a new variable for it cachedDF = df.cache(). This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. Here whenever you call cachedDF.select(…) it will leverage the cached data.

I didn't understand well the logic behind it. And I couldn't find similar suggestions in other articles. 

My question is what the best practice is when using caching? 

5 REPLIES 5

-werners-
Esteemed Contributor III

the best practice when using caching is making sure the cache is used.  That is what David explains in the article you referenced.
By assigning the cached data to a new df, you can easily view the analyzed plan, which is used to read from cache.
If you do not assign the cache to a new df, this is harder to view, as the example in the article shows and you might think cache is used while it is not.

 

Mado
Valued Contributor II

@-werners- 

Thanks for your help. 

If I assign cached DataFrame into a new variable, which one should I use in the next cells for next transformations. For example, 

  • cached_df = df.cache()
  • cached_df.count()

Then, which of the following options is correct for next transformations:

df2 = (df
  .join(...)
  .select(...)
  .filter(...)         )

or:

df2 = (cached_df
  .join(...)
  .select(...)
  .filter(...)         )

 

Vinay_M_R
Databricks Employee
Databricks Employee

Hi  @Mado 

Whenever we wants to display we can do cache() of that dataframe that will ensure that this particular df is cached.

And also as you mentioned once you cached new dataframe df_new, you can unpersist the earlier df after caching df_new.

Lakshay
Databricks Employee
Databricks Employee

In addition to other comments, I will just add that make sure you do the cache only when necessary. i.e. if you need to save a data frame for a time being to be referenced later in the code, then you should consider doing a cache. But if your code has only dataframe being used then we don't need to do cache.

Unnecessary can add to the problem rather than helping.

-werners-
Esteemed Contributor III

Totally agree on this.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group