Hi Debayan, I did notice the number of tasks completed per worker node was different when I looked at the Spark UI -> Executors page. So it does appear the whole cluster was used but what I couldn't tell is if the Driver node was sending out the tasks in parallel across the workers or sequentially assigning. My workflow looks like this:Previously I ran a single notebooks and executed the m# notebooks in a loop like this:
for model_number in MODEL_NUMBERS:
global_parameters['MODEL_NUMBER'] = model_number
print(f"Building {model_number} train/test data...")
train_test_data = dbutils.notebook.run(
'build_train_test',
60 * 60,
global_parameters)
train_test_data = json.loads(train_test_data)
if (file_exists(train_test_data['TRAIN_DATA']) and file_exists(train_test_data['TEST_DATA'])):
f"{model_number} train/test data complete."
print(f"Training model {model_number}...")
trained_model = dbutils.notebook.run(
'training',
0,
global_parameters
)
if file_exists(trained_model):
evaluation_metrics = dbutils.notebook.run(
'evaluation',
60 * 60,
global_parameters
)
for metric in evaluation_metrics.split(','):
if file_exists(metric):
continue
else:
raise FileNotFoundError("Evaluation Metric Not Found: " + metric)
else:
raise FileNotFoundError("Trained model not found: " + trained_model)
else:
raise FileNotFoundError("Training and Test Data Not Found: " + train_test_data)
print(f"Building final_model {model_number} train/test data...")
if ENV in ['test', 'prod']:
dbutils.notebook.run(
'final_models',
0,
global_parameters
)
The sequential loop job runs in 28 min while the async/parallel job runs in 50 min.