Hi,
Running on Databricks-AWS, I have a job running on a cluster with 3 workers, 2-cores each (r6i.large), with autoscaling enabled.
The Spark job has two stages:
(1) highly parallelizable, cpu-intensive stage. This stage takes 15 minutes.
(2) a non-parallelizable stage (only a single partition, so a single spark task). This stage takes 45 minutes.
In the first stage, the cluster scales up from 1 worker to 3, and all 3 workers (6 cores) are fully utilized for the duration of the stage (15 minutes). Then, in the second stage, only a single worker node is active for the entire 45 minutes, but databricks does not scale down my cluster and I have two nodes completely idle for 45 minutes.
Any idea why that is, and how I can utilize autoscaling to be more cost efficient in this type of job?
Thanks!