I'm extracting data from a custom format by day of month using a 32 core executor. I'm using rdds to distribute work across cores of the executor. I'm seeing an intermittent issue where for a run sometimes I see 31 cores being used as expected and other times I see it using 2 cores at a time (30 cores aren't doing anything)... this causes the notebook to take an excessive amount of time to complete. If I cancel the job and rerun it usually uses all the cores as expected. Any thoughts?
The simplified version of my code is something like this:
days_rdd = sc.parallelize(days_to_process)
cmd_results = days_rdd.map(lambda day: do_some_work(start_date,year,month,day)).collect()
for r in cmd_results:
print(r)
view of SparkUI with only 2 cores being used (expect to see 31 cores being used; 1 for each day:
when working the view properly shows the 31 cores being used: