cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime.

dave_hiltbrand
New Contributor II

I have a job with multiple tasks running asynchronously and I don't think its leveraging all the nodes on the cluster based on runtime. I open the Spark UI for the cluster and checkout the executors and don't see any tasks for my worker nodes. How can I monitor the cluster to ensure my tasks are running in parallel and taking advantage of my multiple node cluster?

3 REPLIES 3

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Could you please try to view metrics at the node levels and see if thats what you are expecting?

https://docs.databricks.com/compute/cluster-metrics.html#view-metrics-at-the-node-level

Please tag @Debayan Mukherjeeโ€‹ with your next update so that I will get notified.

Hi Debayan, I did notice the number of tasks completed per worker node was different when I looked at the Spark UI -> Executors page. So it does appear the whole cluster was used but what I couldn't tell is if the Driver node was sending out the tasks in parallel across the workers or sequentially assigning. My workflow looks like this:Screenshot 2023-06-23 072337Previously I ran a single notebooks and executed the m# notebooks in a loop like this:

for model_number in MODEL_NUMBERS:
  
    global_parameters['MODEL_NUMBER'] = model_number
    print(f"Building {model_number} train/test data...")
    train_test_data = dbutils.notebook.run(
        'build_train_test',
        60 * 60,
        global_parameters)
 
    train_test_data = json.loads(train_test_data)
    if (file_exists(train_test_data['TRAIN_DATA']) and file_exists(train_test_data['TEST_DATA'])):
        f"{model_number} train/test data complete."
 
        print(f"Training model {model_number}...") 
        trained_model = dbutils.notebook.run(
          'training',
           0,
          global_parameters
        )
 
        if file_exists(trained_model):
            evaluation_metrics = dbutils.notebook.run(
            'evaluation',
            60 * 60,
            global_parameters
            )
 
            for metric in evaluation_metrics.split(','):
                if file_exists(metric):
                    continue
                else:
                    raise FileNotFoundError("Evaluation Metric Not Found: " + metric)
 
        else:
            raise FileNotFoundError("Trained model not found: " + trained_model)
    
    else:
        raise FileNotFoundError("Training and Test Data Not Found: " + train_test_data)
        
    print(f"Building final_model {model_number} train/test data...")
    if ENV in ['test', 'prod']:
        dbutils.notebook.run(
            'final_models',
            0,
            global_parameters
        )

The sequential loop job runs in 28 min while the async/parallel job runs in 50 min.

Anonymous
Not applicable

Hi @Dave Hiltbrandโ€‹ 

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.