<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic why aren't rdds using all available cores of executor? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/why-aren-t-rdds-using-all-available-cores-of-executor/m-p/27925#M19763</link>
    <description>&lt;P&gt;I'm extracting data from a custom format by day of month using a 32 core executor. I'm using rdds to distribute work across cores of the executor. I'm seeing an intermittent issue where for a run sometimes I see 31 cores being used as expected and other times I see it using 2 cores at a time (30 cores aren't doing anything)... this causes the notebook to take an excessive amount of time to complete. If I cancel the job and rerun it usually uses all the cores as expected. Any thoughts? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The simplified version of my code is something like this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;days_rdd = sc.parallelize(days_to_process)
cmd_results = days_rdd.map(lambda day: do_some_work(start_date,year,month,day)).collect()
for r in cmd_results:
  print(r)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;view of SparkUI with only 2 cores being used (expect to see 31 cores being used; 1 for each day:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1357i52F9F5D7744ED5F0/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;when working the view properly shows the 31 cores being used:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1367i79EDC7760112BD35/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 11 Oct 2022 18:01:20 GMT</pubDate>
    <dc:creator>Matt101122</dc:creator>
    <dc:date>2022-10-11T18:01:20Z</dc:date>
    <item>
      <title>why aren't rdds using all available cores of executor?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-aren-t-rdds-using-all-available-cores-of-executor/m-p/27925#M19763</link>
      <description>&lt;P&gt;I'm extracting data from a custom format by day of month using a 32 core executor. I'm using rdds to distribute work across cores of the executor. I'm seeing an intermittent issue where for a run sometimes I see 31 cores being used as expected and other times I see it using 2 cores at a time (30 cores aren't doing anything)... this causes the notebook to take an excessive amount of time to complete. If I cancel the job and rerun it usually uses all the cores as expected. Any thoughts? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The simplified version of my code is something like this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;days_rdd = sc.parallelize(days_to_process)
cmd_results = days_rdd.map(lambda day: do_some_work(start_date,year,month,day)).collect()
for r in cmd_results:
  print(r)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;view of SparkUI with only 2 cores being used (expect to see 31 cores being used; 1 for each day:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1357i52F9F5D7744ED5F0/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;when working the view properly shows the 31 cores being used:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1367i79EDC7760112BD35/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 11 Oct 2022 18:01:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-aren-t-rdds-using-all-available-cores-of-executor/m-p/27925#M19763</guid>
      <dc:creator>Matt101122</dc:creator>
      <dc:date>2022-10-11T18:01:20Z</dc:date>
    </item>
    <item>
      <title>Re: why aren't rdds using all available cores of executor?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-aren-t-rdds-using-all-available-cores-of-executor/m-p/27926#M19764</link>
      <description>&lt;P&gt;I may have figured this out! &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm explicitly setting the number of slices instead of using the default.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;days_rdd = sc.parallelize(days_to_process,len(days_to_process))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 13 Oct 2022 13:59:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-aren-t-rdds-using-all-available-cores-of-executor/m-p/27926#M19764</guid>
      <dc:creator>Matt101122</dc:creator>
      <dc:date>2022-10-13T13:59:15Z</dc:date>
    </item>
  </channel>
</rss>

