I’m currently working with Databricks autoscaling configurations and trying to better understand how Spark decides when to spin up additional worker nodes. My cluster has a minimum of one worker and can scale up to five. I know that tasks are assigned to cores and that if more tasks are queued than available cores, Spark may consider adding a new worker—assuming autoscaling is enabled. But what specific conditions or metrics does Spark use to trigger the autoscaling event?
For example:
- Is it based solely on the number of pending tasks in the scheduler queue?
- Does it consider task completion times, memory usage, or CPU utilization on existing workers?
- How quickly does autoscaling react once these conditions are met?
A practical scenario: If I have a single worker with 8 cores and I have more tasks than cores for a prolonged period, will Spark immediately add another worker or does it wait for some threshold of sustained load?
I’d appreciate insights from anyone who has worked with Databricks autoscaling in production. Any reference to official documentation or real-world examples of how Spark conditions must be met before a new worker is allocated would be very helpful.