cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

cluster nodes unavailable scenarios

Nino
Contributor
Concerning job cluster configuration, I'm trying to figure out what happens if AWS node type availability is smaller than the minimum number of workers specified in the configuration json (either availabilty<num_workers or, for autoscaling, availabilty<min_workers). 
 
Seeking insights into both scenarios:
  1. Low availability at cluster start
  2. Availability drop while computation is already in progress

Will the cluster start/continue computation? Wait? Fail?
Are there configurations to tweak related cluster behavior?
 
thanks!
2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Nino , 

Low availability at cluster start:
 - Cluster may fail to start if AWS node type availability is smaller than the minimum number of workers specified in configuration JSON.
 - Cluster waits for the required number of workers to become available. The cluster may fail to start if the availability doesn't meet the minimum requirement within the timeout period.
 - Ensure AWS node type availability is sufficient to meet a minimum number of workers specified in configuration JSON.

Availability drop while computation is in progress:
 - If the availability of AWS node types drops below the number of workers currently running in the cluster, computation continues with available workers.
 - The cluster may fail if availability drops below the minimum number of workers specified in the configuration JSON.
 - Monitor availability of AWS node types and ensure it remains above the minimum number of workers specified in configuration JSON during computation. Configurations to tweak related cluster behaviour:
- Increase the timeout period for the cluster to wait for the required number of workers to become available by modifying spark.databricks.clusterUsageTimeout configuration parameter.
- Enable autoscaling for the cluster to handle availability drop while computation is in progress by modifying spark.databricks.autoscale.enabled configuration parameter.

Nino
Contributor

thanks, @Kaniz , useful info!

My specific scenario is running a notebook task with Job Clusters, and I've noticed that I get the best overall notebook run time by going without Autoscaling, setting the cluster configuration with a fixed `num_workers` (specifically, a single notebook where heavy ETL operation is followed by lightweight cmd cell, then something heavy again - cluster autoscales up & down a lot).

So, by your explanation, the num_workers approach puts me at risk in the case of low instance availability. This can be mitigated by Autoscaling, which in turn leads to increased run time. 

Is there a way to configure the Job Cluster so that it "aspires" for an ideal size, but doesn't fail if this ideal isn't reached?

This will be similar to Autoscaling, only that the cluster will not downsize voluntarily (will downsize only if lowered availability forces it to - and even then won't immediately fail). So if configured to "aspire" for 100 nodes, it'll wait x minutes and then start if anything higher than 50 nodes are available. Say 30 minutes later availability grows - it'll upscale, "aspiring" for those 100... 

Can something like this be achived?

Thanks!   

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.