cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

how does Job cluster auto scaling work

aranjan99
Contributor

Can you share the metrics databricks uses during job cluster auto scaling?
Is Databricks  looking at queued tasks, slot utilization etc or just looking at CPU utilizations?

The autoscaling docuemnt https://docs.databricks.com/aws/en/compute/configure?utm_source=chatgpt.com#how-autoscaling-behaves
doest provide details around this.

Also looks like the enhanced autoscaling is not available on job cluster

2 REPLIES 2

anshu_roy
Databricks Employee
Databricks Employee

Hello,
Databricks job cluster autoscaling makes decisions from Spark scheduler signals (pending/queued tasks vs available task slots and idleness windows), not raw CPU% alone. Enhanced autoscaling uses task queue size and task slot utilization. Autoscaling is available for jobs compute, and can be set as described here

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @aranjan99,

The autoscaling behavior on job clusters depends on your workspace pricing tier. Here is a breakdown of the metrics and mechanics involved.

WHAT METRICS DRIVE SCALING DECISIONS

Job cluster autoscaling uses Spark scheduler signals, not raw CPU utilization. The primary inputs are:

- Pending/queued tasks: how many tasks are waiting to be scheduled
- Available task slots: how many slots are open across current executors
- Node idleness: whether nodes have active tasks or are sitting idle
- Shuffle file state (Premium tier only): whether in-progress shuffle data is still needed on a node before it can be safely removed

The autoscaler does not rely on CPU percentage. It is looking at the Spark task scheduler to determine whether there is more work than available capacity (scale up) or excess capacity sitting unused (scale down).

OPTIMIZED AUTOSCALING (PREMIUM PLAN AND ABOVE)

If your workspace is on the Premium plan, your job clusters automatically use "optimized autoscaling," which has these characteristics:

Scale-up:
- Scales from min to max in two steps, so new capacity arrives quickly rather than adding nodes one at a time.

Scale-down:
- Can scale down even when the cluster is not fully idle, by analyzing the shuffle file state to determine whether nodes can be safely removed.
- Scales down based on a percentage of current nodes rather than removing one at a time.
- On job compute, the cluster scales down after just 40 seconds of underutilization (compared to 150 seconds for all-purpose compute).

You can tune the downscaling frequency with this Spark config:

spark.databricks.aggressiveWindowDownS = <seconds>

This controls how often the cluster re-evaluates its scale-down decision. The maximum is 600 seconds. A higher value means the cluster holds onto nodes longer, which can help with bursty workloads that have short idle gaps between stages.

STANDARD AUTOSCALING (STANDARD PLAN)

If your workspace is on the Standard plan, the autoscaling behavior is different:

Scale-up:
- Starts by adding 8 nodes, then scales up exponentially until it reaches the configured max.

Scale-down:
- Only scales down when 90% of nodes are not busy for 10 minutes AND the cluster has been idle for at least 30 seconds.
- Removes nodes exponentially, starting with 1 node.

This is significantly more conservative than optimized autoscaling, especially on the scale-down side.

ENHANCED AUTOSCALING AND JOB CLUSTERS

You are correct that enhanced autoscaling is not available on regular job clusters. Enhanced autoscaling is a separate feature that applies only to Lakeflow Spark Declarative Pipeline (SDP, previously known as Delta Live Tables) update clusters. It uses task-slot utilization and task-queue depth and can proactively shut down underutilized nodes without causing task failures.

If you need that level of autoscaling intelligence and your workload fits the model, consider running your logic as a Lakeflow Spark Declarative Pipeline.

LIMITATIONS TO BE AWARE OF

- Autoscaling is not available for spark-submit jobs.
- Structured Streaming workloads have limited scale-down capabilities since the system avoids removing nodes that are actively processing micro-batches.
- Clusters cannot scale below the configured min_workers value, and cannot scale down to zero workers.

REFERENCES

Compute autoscaling configuration:
https://docs.databricks.com/en/compute/configure.html

Enhanced autoscaling for SDP pipelines:
https://docs.databricks.com/en/delta-live-tables/auto-scaling.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.