<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: What is the best approach to display DataFrame without re-executing the logic each time we displ in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37929#M535</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/65776"&gt;@Mado&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Yes, it is necessary to save the DataFrame into a new variable if you want to use caching to display the DataFrame. This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, and the data that gets cached might not be updated if the table is accessed using a different identifier. Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 19 Jul 2023 07:19:01 GMT</pubDate>
    <dc:creator>Vinay_M_R</dc:creator>
    <dc:date>2023-07-19T07:19:01Z</dc:date>
    <item>
      <title>What is the best approach to display DataFrame without re-executing the logic each time we display?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37916#M530</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have a DataFrame and different transformations are applied on the DataFrame. I want to display DataFrame after several transformations to check the results.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, according to the &lt;A href="https://www.databricks.com/blog/2022/03/10/top-5-databricks-performance-tips.html" target="_self"&gt;Reference&lt;/A&gt;, e&lt;SPAN&gt;very time I try to display results, it runs the execution plan again.&amp;nbsp;A solution has been proposed in the reference by saving the DataFrame and then loading it. However, this solution cannot be applied to the platform I am working on.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Is there any other solution to display results a few times in a notebook without re-executing the logic?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Can I use .cache() for this purpose as below:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;df.cache().count()&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;df.display()&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;And since the name of DataFrame will change in the next lines, I repeat it like below:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;df_new.cache().count()&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;df_new.display()&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jul 2023 05:41:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37916#M530</guid>
      <dc:creator>Mado</dc:creator>
      <dc:date>2023-07-19T05:41:22Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best approach to display DataFrame without re-executing the logic each time we displ</title>
      <link>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37920#M531</link>
      <description>&lt;P&gt;yes df.cache() will work&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jul 2023 06:44:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37920#M531</guid>
      <dc:creator>dream</dc:creator>
      <dc:date>2023-07-19T06:44:00Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best approach to display DataFrame without re-executing the logic each time we displ</title>
      <link>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37926#M533</link>
      <description>&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;In this &lt;A href="https://towardsdatascience.com/best-practices-for-caching-in-spark-sql-b22fb0f02d34" target="_self"&gt;reference&lt;/A&gt;, it is suggested to save the cached DataFrame into a new variable:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;When you cache a DataFrame create a new variable for it&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;cachedDF = df.cache()&lt;/EM&gt;. This will allow you to bypass the problems that we were solving in our example, that sometimes it is not clear what is the analyzed plan and what was actually cached. Here whenever you call&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;cachedDF.select(…)&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;it will leverage the cached data.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I didn't understand well the logic behind it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Do you think it is necessary to save the DataFrame into a new variable in the case that I want to use caching to display the DataFrame?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jul 2023 07:12:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37926#M533</guid>
      <dc:creator>Mado</dc:creator>
      <dc:date>2023-07-19T07:12:56Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best approach to display DataFrame without re-executing the logic each time we displ</title>
      <link>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37929#M535</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/65776"&gt;@Mado&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Yes, it is necessary to save the DataFrame into a new variable if you want to use caching to display the DataFrame. This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame, and the data that gets cached might not be updated if the table is accessed using a different identifier. Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jul 2023 07:19:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37929#M535</guid>
      <dc:creator>Vinay_M_R</dc:creator>
      <dc:date>2023-07-19T07:19:01Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best approach to display DataFrame without re-executing the logic each time we displ</title>
      <link>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37952#M538</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/76894"&gt;@Vinay_M_R&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your help.&lt;/P&gt;&lt;P&gt;I am afraid I didn't understand the reason why it is necessary.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;SPAN&gt;This is because caching the DataFrame can cause it to lose any data skipping that can come from additional filters added on top of the cached DataFrame,&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Note that when df is cached, it is displayed immediately.&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;df.cache().count()&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;df.display()&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Then, a few more transformations are applied on "df" and the results are saved in "df_new" which is cached for display purposes:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;df_new.cache().count()&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;df_new.display()&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;SPAN&gt;and the data that gets cached might not be updated if the table is accessed using a different identifier.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Sorry I didn't understand this part that "&lt;SPAN&gt;if the table is accessed using a different identifier&lt;/SPAN&gt;".&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;SPAN&gt;Therefore, it is recommended to assign the results of Spark transformations back to a SparkDataFrame variable, similar to how you might use common table expressions (CTEs), temporary views, or DataFrames in other systems.&lt;/SPAN&gt;&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;It is done in the notebook. We assign result of transformation to a new DataFrame either caching is used or not.&lt;/P&gt;&lt;P&gt;Is there a reference in databricks documentation in this regard?&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jul 2023 11:48:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/what-is-the-best-approach-to-display-dataframe-without-re/m-p/37952#M538</guid>
      <dc:creator>Mado</dc:creator>
      <dc:date>2023-07-19T11:48:17Z</dc:date>
    </item>
  </channel>
</rss>

