07-18-2023 10:41 PM
Hi,
I have a DataFrame and different transformations are applied on the DataFrame. I want to display DataFrame after several transformations to check the results.
However, according to the Reference, every time I try to display results, it runs the execution plan again. A solution has been proposed in the reference by saving the DataFrame and then loading it. However, this solution cannot be applied to the platform I am working on.
Is there any other solution to display results a few times in a notebook without re-executing the logic?
Can I use .cache() for this purpose as below:
And since the name of DataFrame will change in the next lines, I repeat it like below:
07-19-2023 12:19 AM
Hi @Mado
Yes, it is necessary to save the DataFrame into a new variable if you want to use caching to display the DataFrame. This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, and the data that gets cached might not be updated if the table is accessed using a different identifier. Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.
07-18-2023 11:06 PM - edited 07-18-2023 11:44 PM
yes df.cache() will work
07-19-2023 12:12 AM
Thanks.
In this reference, it is suggested to save the cached DataFrame into a new variable:
I didn't understand well the logic behind it.
Do you think it is necessary to save the DataFrame into a new variable in the case that I want to use caching to display the DataFrame?
07-19-2023 12:19 AM
Hi @Mado
Yes, it is necessary to save the DataFrame into a new variable if you want to use caching to display the DataFrame. This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, and the data that gets cached might not be updated if the table is accessed using a different identifier. Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.
07-19-2023 04:37 AM - edited 07-19-2023 04:48 AM
Thanks for your help.
I am afraid I didn't understand the reason why it is necessary.
This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame,
Note that when df is cached, it is displayed immediately.
Then, a few more transformations are applied on "df" and the results are saved in "df_new" which is cached for display purposes:
and the data that gets cached might not be updated if the table is accessed using a different identifier.
Sorry I didn't understand this part that "if the table is accessed using a different identifier".
Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.
It is done in the notebook. We assign result of transformation to a new DataFrame either caching is used or not.
Is there a reference in databricks documentation in this regard?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group