Databricks Community

Mado · ‎07-18-2023

Hi,

I have a DataFrame and different transformations are applied on the DataFrame. I want to display DataFrame after several transformations to check the results.

However, according to the Reference, every time I try to display results, it runs the execution plan again. A solution has been proposed in the reference by saving the DataFrame and then loading it. However, this solution cannot be applied to the platform I am working on.

Is there any other solution to display results a few times in a notebook without re-executing the logic?

Can I use .cache() for this purpose as below:

df.cache().count()
df.display()

And since the name of DataFrame will change in the next lines, I repeat it like below:

df_new.cache().count()
df_new.display()

Vinay_M_R · ‎07-19-2023

Hi @Mado

Yes, it is necessary to save the DataFrame into a new variable if you want to use caching to display the DataFrame. This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, and the data that gets cached might not be updated if the table is accessed using a different identifier. Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.

View solution in original post

dream · ‎07-18-2023

yes df.cache() will work

Mado · ‎07-19-2023

Thanks.

In this reference, it is suggested to save the cached DataFrame into a new variable:

When you cache a DataFrame create a new variable for it cachedDF = df.cache(). This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. Here whenever you call cachedDF.select(…) it will leverage the cached data.

I didn't understand well the logic behind it.

Do you think it is necessary to save the DataFrame into a new variable in the case that I want to use caching to display the DataFrame?

Vinay_M_R · ‎07-19-2023

Hi @Mado

Yes, it is necessary to save the DataFrame into a new variable if you want to use caching to display the DataFrame. This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, and the data that gets cached might not be updated if the table is accessed using a different identifier. Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.

Mado · ‎07-19-2023

@Vinay_M_R

Thanks for your help.

I am afraid I didn't understand the reason why it is necessary.

This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame,

Note that when df is cached, it is displayed immediately.

df.cache().count()
df.display()

Then, a few more transformations are applied on "df" and the results are saved in "df_new" which is cached for display purposes:

df_new.cache().count()
df_new.display()

and the data that gets cached might not be updated if the table is accessed using a different identifier.

Sorry I didn't understand this part that "if the table is accessed using a different identifier".

Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.

It is done in the notebook. We assign result of transformation to a new DataFrame either caching is used or not.

Is there a reference in databricks documentation in this regard?