topic Re: I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime. in Data Engineering

I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime.

dave_hiltbrand — Fri, 23 Jun 2023 02:47:26 GMT

I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime. I open the Spark UI for the cluster and checkout the executors and don't see any tasks for my worker nodes. How can I monitor the cluster to ensure my tasks are running in parallel and taking advantage of my multiple node cluster?

Re: I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime.

Anonymous — Fri, 23 Jun 2023 07:18:56 GMT

Hi @Dave Hiltbrand

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

Re: I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime.

dave_hiltbrand — Fri, 23 Jun 2023 11:26:29 GMT

Hi Debayan, I did notice the number of tasks completed per worker node was different when I looked at the Spark UI -> Executors page. So it does appear the whole cluster was used but what I couldn't tell is if the Driver node was sending out the tasks in parallel across the workers or sequentially assigning. My workflow looks like this:Previously I ran a single notebooks and executed the m# notebooks in a loop like this:

for model_number in MODEL_NUMBERS:
  
    global_parameters['MODEL_NUMBER'] = model_number
    print(f"Building {model_number} train/test data...")
    train_test_data = dbutils.notebook.run(
        'build_train_test',
        60 * 60,
        global_parameters)
 
    train_test_data = json.loads(train_test_data)
    if (file_exists(train_test_data['TRAIN_DATA']) and file_exists(train_test_data['TEST_DATA'])):
        f"{model_number} train/test data complete."
 
        print(f"Training model {model_number}...") 
        trained_model = dbutils.notebook.run(
          'training',
           0,
          global_parameters
        )
 
        if file_exists(trained_model):
            evaluation_metrics = dbutils.notebook.run(
            'evaluation',
            60 * 60,
            global_parameters
            )
 
            for metric in evaluation_metrics.split(','):
                if file_exists(metric):
                    continue
                else:
                    raise FileNotFoundError("Evaluation Metric Not Found: " + metric)
 
        else:
            raise FileNotFoundError("Trained model not found: " + trained_model)
    
    else:
        raise FileNotFoundError("Training and Test Data Not Found: " + train_test_data)
        
    print(f"Building final_model {model_number} train/test data...")
    if ENV in ['test', 'prod']:
        dbutils.notebook.run(
            'final_models',
            0,
            global_parameters
        )

The sequential loop job runs in 28 min while the async/parallel job runs in 50 min.

Re: I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime.

Debayan — Fri, 23 Jun 2023 07:18:12 GMT

Hi, Could you please try to view metrics at the node levels and see if thats what you are expecting?

https://docs.databricks.com/compute/cluster-metrics.html#view-metrics-at-the-node-level

Please tag @Debayan Mukherjee with your next update so that I will get notified.