<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Should I always cache my RDD's and DataFrames? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30766#M22333</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Note that &lt;PRE&gt;&lt;CODE&gt;cache()&lt;/CODE&gt;&lt;/PRE&gt; is now an alias for &lt;PRE&gt;&lt;CODE&gt;persist(StorageLevel.MEMORY_AND_DISK)&lt;/CODE&gt;&lt;/PRE&gt; according to the &lt;A target="_blank" href="https://"&gt;docs&lt;/A&gt;.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 14 May 2019 13:26:21 GMT</pubDate>
    <dc:creator>MichaelFryar_</dc:creator>
    <dc:date>2019-05-14T13:26:21Z</dc:date>
    <item>
      <title>Should I always cache my RDD's and DataFrames?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30763#M22330</link>
      <description />
      <pubDate>Tue, 24 Feb 2015 23:40:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30763#M22330</guid>
      <dc:creator>cfregly</dc:creator>
      <dc:date>2015-02-24T23:40:15Z</dc:date>
    </item>
    <item>
      <title>Re: Should I always cache my RDD's and DataFrames?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30764#M22331</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;You should definitely &lt;PRE&gt;&lt;CODE&gt;cache()&lt;/CODE&gt;&lt;/PRE&gt; RDD's and DataFrames in the following cases:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Reusing them in an iterative loop (ie. ML algos) &lt;/LI&gt;&lt;LI&gt;Reuse the RDD multiple times in a single application, job, or notebook.&lt;/LI&gt;&lt;LI&gt;When the upfront cost to regenerate the RDD partitions is costly (ie. HDFS, after a complex set of &lt;PRE&gt;&lt;CODE&gt;map()&lt;/CODE&gt;&lt;/PRE&gt;, &lt;PRE&gt;&lt;CODE&gt;filter()&lt;/CODE&gt;&lt;/PRE&gt;, etc.)  This helps in the recovery process if a Worker node dies.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Keep in mind that Spark will automatically evict RDD partitions from Workers in an LRU manner.  The LRU eviction happens independently on each Worker and depends on the available memory in the Worker.&lt;/P&gt;&lt;P&gt;During the lifecycle of an RDD, RDD partitions may exist in memory or on disk across the cluster depending on available memory. &lt;/P&gt;&lt;P&gt;The Storage tab on the Spark UI shows where partitions exist (memory or disk) across the cluster at any given point in time.&lt;/P&gt;&lt;P&gt;Note that &lt;PRE&gt;&lt;CODE&gt;cache()&lt;/CODE&gt;&lt;/PRE&gt; is an alias for &lt;PRE&gt;&lt;CODE&gt;persist(StorageLevel.MEMORY_ONLY)&lt;/CODE&gt;&lt;/PRE&gt; which may not be ideal for datasets larger than available cluster memory.  Each RDD partition that is evicted out of memory will need to be rebuilt from source (ie. HDFS, Network, etc) which is expensive.&lt;/P&gt;&lt;P&gt;A better solution would be to use &lt;PRE&gt;&lt;CODE&gt;persist(StorageLevel.MEMORY_AND_DISK_ONLY)&lt;/CODE&gt;&lt;/PRE&gt; which will spill the RDD partitions to the Worker's local disk if they're evicted from memory.  In this case, rebuilding a partition only requires pulling data from the Worker's local disk which is relatively fast.&lt;/P&gt;&lt;P&gt;You also have the choice of persisting the data as a serialized byte array by appending &lt;PRE&gt;&lt;CODE&gt;_SER&lt;/CODE&gt;&lt;/PRE&gt; as follows:  &lt;PRE&gt;&lt;CODE&gt;MEMORY_SER&lt;/CODE&gt;&lt;/PRE&gt; and &lt;PRE&gt;&lt;CODE&gt;MEMORY_AND_DISK_SER&lt;/CODE&gt;&lt;/PRE&gt;.  This can save space, but incurs an extra serialization/deserialization penalty.  And because we're storing data as a serialized byte arrays, less Java objects are created and therefore GC pressure is reduced.&lt;/P&gt;&lt;P&gt;You can also choose to replicate the data to another node by append &lt;PRE&gt;&lt;CODE&gt;_2&lt;/CODE&gt;&lt;/PRE&gt; to the StorageLevel (either serialized or not serialized) as follows:  &lt;PRE&gt;&lt;CODE&gt;MEMORY_SER_2&lt;/CODE&gt;&lt;/PRE&gt; and &lt;PRE&gt;&lt;CODE&gt;MEMORY_AND_DISK_2&lt;/CODE&gt;&lt;/PRE&gt;.  This enables fast partition recovery in the case of a node failure as data can be rebuilt from a rack-local, neighboring node through the same network switch, for example.&lt;/P&gt;&lt;P&gt;You can see the full list here:  &lt;A href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel$" target="_blank"&gt;https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel$&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 24 Feb 2015 23:40:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30764#M22331</guid>
      <dc:creator>cfregly</dc:creator>
      <dc:date>2015-02-24T23:40:55Z</dc:date>
    </item>
    <item>
      <title>Re: Should I always cache my RDD's and DataFrames?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30765#M22332</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;What is most efficient between RDD and DataFrame ? (I mean better to cache, consume less memory)&lt;/P&gt;
&lt;P&gt;Thanks you,&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 11 Apr 2017 09:24:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30765#M22332</guid>
      <dc:creator>ThomasDecaux</dc:creator>
      <dc:date>2017-04-11T09:24:38Z</dc:date>
    </item>
    <item>
      <title>Re: Should I always cache my RDD's and DataFrames?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30766#M22333</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Note that &lt;PRE&gt;&lt;CODE&gt;cache()&lt;/CODE&gt;&lt;/PRE&gt; is now an alias for &lt;PRE&gt;&lt;CODE&gt;persist(StorageLevel.MEMORY_AND_DISK)&lt;/CODE&gt;&lt;/PRE&gt; according to the &lt;A target="_blank" href="https://"&gt;docs&lt;/A&gt;.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 14 May 2019 13:26:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30766#M22333</guid>
      <dc:creator>MichaelFryar_</dc:creator>
      <dc:date>2019-05-14T13:26:21Z</dc:date>
    </item>
    <item>
      <title>Re: Should I always cache my RDD's and DataFrames?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30767#M22334</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thanks for this clarification on deserialization penalty. I always wanted to know when this penalty is imposed. &lt;/P&gt;
&lt;P&gt;&lt;A target="_blank" href="https://"&gt;https://domyhomeworkonline.net/do-my-summer-homework.php&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2019 10:31:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30767#M22334</guid>
      <dc:creator>ScottKinman</dc:creator>
      <dc:date>2019-07-04T10:31:05Z</dc:date>
    </item>
    <item>
      <title>Re: Should I always cache my RDD's and DataFrames?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30768#M22335</link>
      <description>&lt;P&gt;Hello Mefryar, &lt;/P&gt;&lt;P&gt;I still see cache is an alias of persist(StorageLevel.MEMORY_ONLY). Attached doc links.&lt;/P&gt;&lt;P&gt;Official doc&lt;/P&gt;&lt;P&gt;Official Pyspark doc&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 18 Jan 2020 15:21:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30768#M22335</guid>
      <dc:creator>SivagangireddyS</dc:creator>
      <dc:date>2020-01-18T15:21:32Z</dc:date>
    </item>
    <item>
      <title>Re: Should I always cache my RDD's and DataFrames?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30769#M22336</link>
      <description>&lt;P&gt;Hi, @Sivagangireddy Singam​&amp;nbsp;Singam. I see that the RDD programming guide does say that the default storage level is &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;MEMORY_ONLY&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;, but the &lt;I&gt;latest &lt;/I&gt;PySpark docs (2.4.4) &lt;A href="https://" alt="https://" target="_blank"&gt;state&lt;/A&gt; "The default storage level has changed to &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;MEMORY_AND_DISK&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;." (The PySpark docs you linked to were 2.1.2.) &lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2020 17:31:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/m-p/30769#M22336</guid>
      <dc:creator>MichaelFryar_</dc:creator>
      <dc:date>2020-01-21T17:31:02Z</dc:date>
    </item>
  </channel>
</rss>

