Databricks Community

Matt101122 · ‎10-11-2022

I'm extracting data from a custom format by day of month using a 32 core executor. I'm using rdds to distribute work across cores of the executor. I'm seeing an intermittent issue where for a run sometimes I see 31 cores being used as expected and other times I see it using 2 cores at a time (30 cores aren't doing anything)... this causes the notebook to take an excessive amount of time to complete. If I cancel the job and rerun it usually uses all the cores as expected. Any thoughts?

The simplified version of my code is something like this:

days_rdd = sc.parallelize(days_to_process)
cmd_results = days_rdd.map(lambda day: do_some_work(start_date,year,month,day)).collect()
for r in cmd_results:
  print(r)

view of SparkUI with only 2 cores being used (expect to see 31 cores being used; 1 for each day:

when working the view properly shows the 31 cores being used:

Matt101122 · ‎10-13-2022

I may have figured this out!

I'm explicitly setting the number of slices instead of using the default.

days_rdd = sc.parallelize(days_to_process,len(days_to_process))

View solution in original post

Matt101122 · ‎10-13-2022

I may have figured this out!

I'm explicitly setting the number of slices instead of using the default.

days_rdd = sc.parallelize(days_to_process,len(days_to_process))

Databricks Community

why aren't rdds using all available cores of executor?

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!