I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-22-2023 07:47 PM
I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime. I open the Spark UI for the cluster and checkout the executors and don't see any tasks for my worker nodes. How can I monitor the cluster to ensure my tasks are running in parallel and taking advantage of my multiple node cluster?
- Labels:
-
Job
-
Multiple Tasks
-
Spark ui
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2023 12:18 AM
Hi, Could you please try to view metrics at the node levels and see if thats what you are expecting?
https://docs.databricks.com/compute/cluster-metrics.html#view-metrics-at-the-node-level
Please tag @Debayan Mukherjee with your next update so that I will get notified.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2023 04:26 AM
Hi Debayan, I did notice the number of tasks completed per worker node was different when I looked at the Spark UI -> Executors page. So it does appear the whole cluster was used but what I couldn't tell is if the Driver node was sending out the tasks in parallel across the workers or sequentially assigning. My workflow looks like this:Previously I ran a single notebooks and executed the m# notebooks in a loop like this:
for model_number in MODEL_NUMBERS:
global_parameters['MODEL_NUMBER'] = model_number
print(f"Building {model_number} train/test data...")
train_test_data = dbutils.notebook.run(
'build_train_test',
60 * 60,
global_parameters)
train_test_data = json.loads(train_test_data)
if (file_exists(train_test_data['TRAIN_DATA']) and file_exists(train_test_data['TEST_DATA'])):
f"{model_number} train/test data complete."
print(f"Training model {model_number}...")
trained_model = dbutils.notebook.run(
'training',
0,
global_parameters
)
if file_exists(trained_model):
evaluation_metrics = dbutils.notebook.run(
'evaluation',
60 * 60,
global_parameters
)
for metric in evaluation_metrics.split(','):
if file_exists(metric):
continue
else:
raise FileNotFoundError("Evaluation Metric Not Found: " + metric)
else:
raise FileNotFoundError("Trained model not found: " + trained_model)
else:
raise FileNotFoundError("Training and Test Data Not Found: " + train_test_data)
print(f"Building final_model {model_number} train/test data...")
if ENV in ['test', 'prod']:
dbutils.notebook.run(
'final_models',
0,
global_parameters
)
The sequential loop job runs in 28 min while the async/parallel job runs in 50 min.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2023 12:18 AM
Hi @Dave Hiltbrand
Great to meet you, and thanks for your question!
Let's see if your peers in the community have an answer to your question. Thanks.

