cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

cluster nodes unavailable scenarios

Nino
Contributor
Concerning job cluster configuration, I'm trying to figure out what happens if AWS node type availability is smaller than the minimum number of workers specified in the configuration json (either availabilty<num_workers or, for autoscaling, availabilty<min_workers). 
 
Seeking insights into both scenarios:
  1. Low availability at cluster start
  2. Availability drop while computation is already in progress

Will the cluster start/continue computation? Wait? Fail?
Are there configurations to tweak related cluster behavior?
 
thanks!
2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Nino , 

โ€ข Low availability at cluster start:
 - Cluster may fail to start if AWS node type availability is smaller than the minimum number of workers specified in configuration JSON.
 - Cluster waits for the required number of workers to become available. The cluster may fail to start if the availability doesn't meet the minimum requirement within the timeout period.
 - Ensure AWS node type availability is sufficient to meet a minimum number of workers specified in configuration JSON.

โ€ข Availability drop while computation is in progress:
 - If the availability of AWS node types drops below the number of workers currently running in the cluster, computation continues with available workers.
 - The cluster may fail if availability drops below the minimum number of workers specified in the configuration JSON.
 - Monitor availability of AWS node types and ensure it remains above the minimum number of workers specified in configuration JSON during computation. Configurations to tweak related cluster behaviour:
- Increase the timeout period for the cluster to wait for the required number of workers to become available by modifying spark.databricks.clusterUsageTimeout configuration parameter.
- Enable autoscaling for the cluster to handle availability drop while computation is in progress by modifying spark.databricks.autoscale.enabled configuration parameter.

Nino
Contributor

thanks, @Kaniz_Fatma , useful info!

My specific scenario is running a notebook task with Job Clusters, and I've noticed that I get the best overall notebook run time by going without Autoscaling, setting the cluster configuration with a fixed `num_workers` (specifically, a single notebook where heavy ETL operation is followed by lightweight cmd cell, then something heavy again - cluster autoscales up & down a lot).

So, by your explanation, the num_workers approach puts me at risk in the case of low instance availability. This can be mitigated by Autoscaling, which in turn leads to increased run time. 

Is there a way to configure the Job Cluster so that it "aspires" for an ideal size, but doesn't fail if this ideal isn't reached?

This will be similar to Autoscaling, only that the cluster will not downsize voluntarily (will downsize only if lowered availability forces it to - and even then won't immediately fail). So if configured to "aspire" for 100 nodes, it'll wait x minutes and then start if anything higher than 50 nodes are available. Say 30 minutes later availability grows - it'll upscale, "aspiring" for those 100... 

Can something like this be achived?

Thanks!   

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group