cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime.

dave_hiltbrand
New Contributor II

I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime. I open the Spark UI for the cluster and checkout the executors and don't see any tasks for my worker nodes. How can I monitor the cluster to ensure my tasks are running in parallel and taking advantage of my multiple node cluster?

3 REPLIES 3

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Could you please try to view metrics at the node levels and see if thats what you are expecting?

https://docs.databricks.com/compute/cluster-metrics.html#view-metrics-at-the-node-level

Please tag @Debayan Mukherjeeโ€‹ with your next update so that I will get notified.

Hi Debayan, I did notice the number of tasks completed per worker node was different when I looked at the Spark UI -> Executors page. So it does appear the whole cluster was used but what I couldn't tell is if the Driver node was sending out the tasks in parallel across the workers or sequentially assigning. My workflow looks like this:Screenshot 2023-06-23 072337Previously I ran a single notebooks and executed the m# notebooks in a loop like this:

for model_number in MODEL_NUMBERS:
  
    global_parameters['MODEL_NUMBER'] = model_number
    print(f"Building {model_number} train/test data...")
    train_test_data = dbutils.notebook.run(
        'build_train_test',
        60 * 60,
        global_parameters)
 
    train_test_data = json.loads(train_test_data)
    if (file_exists(train_test_data['TRAIN_DATA']) and file_exists(train_test_data['TEST_DATA'])):
        f"{model_number} train/test data complete."
 
        print(f"Training model {model_number}...") 
        trained_model = dbutils.notebook.run(
          'training',
           0,
          global_parameters
        )
 
        if file_exists(trained_model):
            evaluation_metrics = dbutils.notebook.run(
            'evaluation',
            60 * 60,
            global_parameters
            )
 
            for metric in evaluation_metrics.split(','):
                if file_exists(metric):
                    continue
                else:
                    raise FileNotFoundError("Evaluation Metric Not Found: " + metric)
 
        else:
            raise FileNotFoundError("Trained model not found: " + trained_model)
    
    else:
        raise FileNotFoundError("Training and Test Data Not Found: " + train_test_data)
        
    print(f"Building final_model {model_number} train/test data...")
    if ENV in ['test', 'prod']:
        dbutils.notebook.run(
            'final_models',
            0,
            global_parameters
        )

The sequential loop job runs in 28 min while the async/parallel job runs in 50 min.

Anonymous
Not applicable

Hi @Dave Hiltbrandโ€‹ 

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.