cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What is the best approach to display DataFrame without re-executing the logic each time we display?

Mado
Valued Contributor II

Hi,

I have a DataFrame and different transformations are applied on the DataFrame. I want to display DataFrame after several transformations to check the results. 

 

However, according to the Reference, every time I try to display results, it runs the execution plan again. A solution has been proposed in the reference by saving the DataFrame and then loading it. However, this solution cannot be applied to the platform I am working on. 

 

Is there any other solution to display results a few times in a notebook without re-executing the logic?

 

Can I use .cache() for this purpose as below:

  • df.cache().count()
  • df.display()

And since the name of DataFrame will change in the next lines, I repeat it like below:

  • df_new.cache().count()
  • df_new.display()

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Vinay_M_R
Databricks Employee
Databricks Employee

Hi @Mado 

Yes, it is necessary to save the DataFrame into a new variable if you want to use caching to display the DataFrame. This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, and the data that gets cached might not be updated if the table is accessed using a different identifier. Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.

View solution in original post

4 REPLIES 4

dream
Contributor

yes df.cache() will work

Mado
Valued Contributor II

Thanks.

In this reference, it is suggested to save the cached DataFrame into a new variable:

  • When you cache a DataFrame create a new variable for it cachedDF = df.cache(). This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. Here whenever you call cachedDF.select(โ€ฆ) it will leverage the cached data.

I didn't understand well the logic behind it. 

Do you think it is necessary to save the DataFrame into a new variable in the case that I want to use caching to display the DataFrame? 

 

Vinay_M_R
Databricks Employee
Databricks Employee

Hi @Mado 

Yes, it is necessary to save the DataFrame into a new variable if you want to use caching to display the DataFrame. This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, and the data that gets cached might not be updated if the table is accessed using a different identifier. Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.

Mado
Valued Contributor II

@Vinay_M_R 

Thanks for your help.

I am afraid I didn't understand the reason why it is necessary.

This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, 


Note that when df is cached, it is displayed immediately. 

  • df.cache().count()
  • df.display()

Then, a few more transformations are applied on "df" and the results are saved in "df_new" which is cached for display purposes:

  • df_new.cache().count()
  • df_new.display()

 

and the data that gets cached might not be updated if the table is accessed using a different identifier. 


Sorry I didn't understand this part that "if the table is accessed using a different identifier".

Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.


It is done in the notebook. We assign result of transformation to a new DataFrame either caching is used or not.

Is there a reference in databricks documentation in this regard?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group