cluster nodes unavailable scenarios

Nino — Mon, 11 Sep 2023 13:35:42 GMT

Concerning job cluster configuration, I'm trying to figure out what happens if AWS node type availability is smaller than the minimum number of workers specified in the configuration json (either availabilty<num_workers or, for autoscaling, availabilty<min_workers).

Seeking insights into both scenarios:

Low availability at cluster start
Availability drop while computation is already in progress

Will the cluster start/continue computation? Wait? Fail?
Are there configurations to tweak related cluster behavior?

thanks!

Re: cluster nodes unavailable scenarios

Nino — Tue, 12 Sep 2023 07:11:07 GMT

thanks, @Retired_mod , useful info!

My specific scenario is running a notebook task with Job Clusters, and I've noticed that I get the best overall notebook run time by going without Autoscaling, setting the cluster configuration with a fixed `num_workers` (specifically, a single notebook where heavy ETL operation is followed by lightweight cmd cell, then something heavy again - cluster autoscales up & down a lot).

So, by your explanation, the num_workers approach puts me at risk in the case of low instance availability. This can be mitigated by Autoscaling, which in turn leads to increased run time.

Is there a way to configure the Job Cluster so that it "aspires" for an ideal size, but doesn't fail if this ideal isn't reached?

This will be similar to Autoscaling, only that the cluster will not downsize voluntarily (will downsize only if lowered availability forces it to - and even then won't immediately fail). So if configured to "aspire" for 100 nodes, it'll wait x minutes and then start if anything higher than 50 nodes are available. Say 30 minutes later availability grows - it'll upscale, "aspiring" for those 100...

Can something like this be achived?

Thanks!

topic cluster nodes unavailable scenarios in Data Engineering

cluster nodes unavailable scenarios

Re: cluster nodes unavailable scenarios