Hi,
When caching a DataFrame, I always use "df.cache().count()".
However, in this reference, it is suggested to save the cached DataFrame into a new variable:
- When you cache a DataFrame create a new variable for it cachedDF = df.cache(). This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. Here whenever you call cachedDF.select(…) it will leverage the cached data.
I didn't understand well the logic behind it. And I couldn't find similar suggestions in other articles.
My question is what the best practice is when using caching?