<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Need Suggestion for better caching strategy in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/need-suggestion-for-better-caching-strategy/m-p/58092#M31014</link>
    <description>&lt;P&gt;i have below steps to perform&amp;nbsp;&lt;/P&gt;&lt;P&gt;1.Read a csv file (considerably huge file .. ~100gb)&lt;/P&gt;&lt;P&gt;2.add index using zipwithindex function&amp;nbsp;&lt;/P&gt;&lt;P&gt;3.repartition dataframe&amp;nbsp;&lt;/P&gt;&lt;P&gt;4.Passing on to another function .&lt;/P&gt;&lt;P&gt;Can you suggest the best optimized caching strategy to execute these commands faster.&lt;/P&gt;&lt;P&gt;Below is the cluster configuration i have&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vishwanath_1_0-1705915220664.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/5909i7998E0612A4B00E6/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="vishwanath_1_0-1705915220664.png" alt="vishwanath_1_0-1705915220664.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Few more queries :-&lt;/P&gt;&lt;P&gt;1. i always had doubt ,if using 1 worker would suffice for my operation ?&lt;/P&gt;&lt;P&gt;2. what is the optimal number to give for repartitioning here.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 22 Jan 2024 09:26:32 GMT</pubDate>
    <dc:creator>vishwanath_1</dc:creator>
    <dc:date>2024-01-22T09:26:32Z</dc:date>
    <item>
      <title>Need Suggestion for better caching strategy</title>
      <link>https://community.databricks.com/t5/data-engineering/need-suggestion-for-better-caching-strategy/m-p/58092#M31014</link>
      <description>&lt;P&gt;i have below steps to perform&amp;nbsp;&lt;/P&gt;&lt;P&gt;1.Read a csv file (considerably huge file .. ~100gb)&lt;/P&gt;&lt;P&gt;2.add index using zipwithindex function&amp;nbsp;&lt;/P&gt;&lt;P&gt;3.repartition dataframe&amp;nbsp;&lt;/P&gt;&lt;P&gt;4.Passing on to another function .&lt;/P&gt;&lt;P&gt;Can you suggest the best optimized caching strategy to execute these commands faster.&lt;/P&gt;&lt;P&gt;Below is the cluster configuration i have&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vishwanath_1_0-1705915220664.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/5909i7998E0612A4B00E6/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="vishwanath_1_0-1705915220664.png" alt="vishwanath_1_0-1705915220664.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Few more queries :-&lt;/P&gt;&lt;P&gt;1. i always had doubt ,if using 1 worker would suffice for my operation ?&lt;/P&gt;&lt;P&gt;2. what is the optimal number to give for repartitioning here.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2024 09:26:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-suggestion-for-better-caching-strategy/m-p/58092#M31014</guid>
      <dc:creator>vishwanath_1</dc:creator>
      <dc:date>2024-01-22T09:26:32Z</dc:date>
    </item>
    <item>
      <title>Re: Need Suggestion for better caching strategy</title>
      <link>https://community.databricks.com/t5/data-engineering/need-suggestion-for-better-caching-strategy/m-p/58117#M31020</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/98019"&gt;@vishwanath_1&lt;/a&gt;&amp;nbsp;, Caching only comes into picture when there are multiple reference to data source in your code. As per the flow mentioned by you, I don't see that being the case for you. You are only reading the data from source once and also there is no branching in your code. In this case, even if you use caching it will never be used.&lt;/P&gt;
&lt;P&gt;Regarding your other queries:-&lt;/P&gt;
&lt;P&gt;1. What is the optimal number of repartitions:- You should look to divide the data into chunks of 200MB-300MB size.&amp;nbsp;Provided that you are reading (~100 GB) of data, so 100Gb/200MB =500 partitions. This is roughly how many partitions we should look to have.&lt;/P&gt;
&lt;P&gt;2.&amp;nbsp;&lt;SPAN&gt;if using 1 worker would suffice for my operation:- This depends on 3 things. 1. Data volume , 2. Type of operations and 3. Cluster config. As your 1 worker has 256 GB memory size and you are reading 100 GB of data and the operations being performed in code dont seem to be too much memory consuming, I think using 1 worker will be enough from a memory perspective.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;But it can be a time-consuming process. As your single worker has only 64 cores and If you repartition the data into 500 partitions. So, at a time only 64 tasks can run. Hence, to complete a single stage it will take ~8 CPU cycles which might not be that efficient.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2024 12:03:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-suggestion-for-better-caching-strategy/m-p/58117#M31020</guid>
      <dc:creator>Lakshay</dc:creator>
      <dc:date>2024-01-22T12:03:52Z</dc:date>
    </item>
  </channel>
</rss>

