<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Do I have to run .cache() on my dataframe before returning aggregations like count? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/do-i-have-to-run-cache-on-my-dataframe-before-returning/m-p/23596#M16318</link>
    <description>&lt;P&gt;Better to use cache when dataframe is used multiple times in a single pipeline.&lt;/P&gt;&lt;P&gt;Using&amp;nbsp; cache() &amp;nbsp;and&amp;nbsp;persist() &amp;nbsp;methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.&lt;/P&gt;</description>
    <pubDate>Thu, 17 Jun 2021 14:58:40 GMT</pubDate>
    <dc:creator>Srikanth_Gupta_</dc:creator>
    <dc:date>2021-06-17T14:58:40Z</dc:date>
    <item>
      <title>Do I have to run .cache() on my dataframe before returning aggregations like count?</title>
      <link>https://community.databricks.com/t5/data-engineering/do-i-have-to-run-cache-on-my-dataframe-before-returning/m-p/23595#M16317</link>
      <description />
      <pubDate>Wed, 16 Jun 2021 23:08:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/do-i-have-to-run-cache-on-my-dataframe-before-returning/m-p/23595#M16317</guid>
      <dc:creator>User16826992666</dc:creator>
      <dc:date>2021-06-16T23:08:38Z</dc:date>
    </item>
    <item>
      <title>Re: Do I have to run .cache() on my dataframe before returning aggregations like count?</title>
      <link>https://community.databricks.com/t5/data-engineering/do-i-have-to-run-cache-on-my-dataframe-before-returning/m-p/23596#M16318</link>
      <description>&lt;P&gt;Better to use cache when dataframe is used multiple times in a single pipeline.&lt;/P&gt;&lt;P&gt;Using&amp;nbsp; cache() &amp;nbsp;and&amp;nbsp;persist() &amp;nbsp;methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jun 2021 14:58:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/do-i-have-to-run-cache-on-my-dataframe-before-returning/m-p/23596#M16318</guid>
      <dc:creator>Srikanth_Gupta_</dc:creator>
      <dc:date>2021-06-17T14:58:40Z</dc:date>
    </item>
    <item>
      <title>Re: Do I have to run .cache() on my dataframe before returning aggregations like count?</title>
      <link>https://community.databricks.com/t5/data-engineering/do-i-have-to-run-cache-on-my-dataframe-before-returning/m-p/23597#M16319</link>
      <description>&lt;P&gt;You do not have to cache anything to make it work. You would decide that based on whether you want to spend memory/storage to avoid recomputing the DataFrame, like when you may use it in multiple operations afterwards.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jun 2021 18:24:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/do-i-have-to-run-cache-on-my-dataframe-before-returning/m-p/23597#M16319</guid>
      <dc:creator>sean_owen</dc:creator>
      <dc:date>2021-06-17T18:24:29Z</dc:date>
    </item>
  </channel>
</rss>

