why aren't rdds using all available cores of executor?

Matt101122 — Tue, 11 Oct 2022 18:01:20 GMT

I'm extracting data from a custom format by day of month using a 32 core executor. I'm using rdds to distribute work across cores of the executor. I'm seeing an intermittent issue where for a run sometimes I see 31 cores being used as expected and other times I see it using 2 cores at a time (30 cores aren't doing anything)... this causes the notebook to take an excessive amount of time to complete. If I cancel the job and rerun it usually uses all the cores as expected. Any thoughts?

The simplified version of my code is something like this:

days_rdd = sc.parallelize(days_to_process)
cmd_results = days_rdd.map(lambda day: do_some_work(start_date,year,month,day)).collect()
for r in cmd_results:
  print(r)

view of SparkUI with only 2 cores being used (expect to see 31 cores being used; 1 for each day:

when working the view properly shows the 31 cores being used:

Re: why aren't rdds using all available cores of executor?

Matt101122 — Thu, 13 Oct 2022 13:59:15 GMT

I may have figured this out!

I'm explicitly setting the number of slices instead of using the default.

days_rdd = sc.parallelize(days_to_process,len(days_to_process))

topic why aren't rdds using all available cores of executor? in Data Engineering

why aren't rdds using all available cores of executor?

Re: why aren't rdds using all available cores of executor?