cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

why aren't rdds using all available cores of executor?

Matt101122
Contributor

I'm extracting data from a custom format by day of month using a 32 core executor. I'm using rdds to distribute work across cores of the executor. I'm seeing an intermittent issue where for a run sometimes I see 31 cores being used as expected and other times I see it using 2 cores at a time (30 cores aren't doing anything)... this causes the notebook to take an excessive amount of time to complete. If I cancel the job and rerun it usually uses all the cores as expected. Any thoughts?

The simplified version of my code is something like this:

days_rdd = sc.parallelize(days_to_process)
cmd_results = days_rdd.map(lambda day: do_some_work(start_date,year,month,day)).collect()
for r in cmd_results:
  print(r)

 view of SparkUI with only 2 cores being used (expect to see 31 cores being used; 1 for each day:

image 

when working the view properly shows the 31 cores being used:

image

1 ACCEPTED SOLUTION

Accepted Solutions

Matt101122
Contributor

I may have figured this out!

I'm explicitly setting the number of slices instead of using the default.

days_rdd = sc.parallelize(days_to_process,len(days_to_process))

View solution in original post

1 REPLY 1

Matt101122
Contributor

I may have figured this out!

I'm explicitly setting the number of slices instead of using the default.

days_rdd = sc.parallelize(days_to_process,len(days_to_process))

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.