Databricks Community

jAAmes_bentley · ‎03-06-2026

Hello,

When deploying agents via a Serving Endpoint with "Scale To Zero" enabled, we are encountering that at seemingly random times, when we hit the endpoint to wake it up, it will begin to "Scale From Zero", but will never become fully available.

The logs all seem to indicate that is has scaled successfully. The service logs will say something like

```

qkrj4] 2026-03-06 16:14:46.363 INFO : Initializing .........
[qkrj4] WARNING:root:mlflow-server
[qkrj4] [2026-03-06 16:14:46 +0000] [13] [INFO] Starting gunicorn 25.1.0
[qkrj4] [2026-03-06 16:14:46 +0000] [13] [INFO] Listening at: http://0.0.0.0:8080 (13)
[qkrj4] [2026-03-06 16:14:46 +0000] [13] [INFO] Using worker: sync
[qkrj4] [2026-03-06 16:14:46 +0000] [13] [INFO] Control socket listening at /gunicorn.ctl
[qkrj4] [2026-03-06 16:14:46 +0000] [16] [INFO] Booting worker with pid: 16
[qkrj4] [2026-03-06 16:14:46 +0000] [17] [INFO] Booting worker with pid: 17
[qkrj4] [2026-03-06 16:14:46 +0000] [18] [INFO] Booting worker with pid: 18

```

And the build logs will say something like:
```

#14 58.94
#14 58.94 done
#14 58.94 #
#14 58.94 # To activate this environment, use
#14 58.94 #
#14 58.94 # $ conda activate mlflow-env
#14 58.94 #
#14 58.94 # To deactivate an active environment, use
#14 58.94 #
#14 58.94 # $ conda deactivate
#14 58.94
#14 59.01 Conda environment created successfully on attempt 1.
#14 59.01 Restoring original system-wide conda config

```

And the CPU usage / provisioned concurrency metrics will be reigstering that the endpoint is active, and we will even start being billed. But all endpoint hit attempts will timeout, and the UI will continue to say "(Scaling from zero)".

Is this a known bug? Or is there anything we can do about it?

Thanks

SteveOstrowski · ‎03-06-2026

Hi @jAAmes_bentley,

There are multiple possible root causes, so let me walk through them so you can diagnose which applies to your situation.

WHAT HAPPENS DURING SCALE-FROM-ZERO

When an endpoint scales from zero, Databricks must:

1. Acquire compute capacity (CPU or GPU) from the regional pool

2. Build/restore the container with your model's conda environment

3. Start the gunicorn server and boot worker processes

4. Load your model/agent into memory

5. Pass health/readiness checks so the proxy can route traffic

Your logs show gunicorn starting and workers booting, which means you're getting past steps 1-3 but likely failing at steps 4 or 5. This is an important clue.

ROOT CAUSE #1: MLflow 3.10.0 Gunicorn Deadlock (Most Likely)

I notice your logs show "Starting gunicorn 25.1.0". MLflow 3.10.0 upgraded gunicorn from 23.0.0 to 25.1.0, which introduced an intermittent system call deadlock in the model serving infrastructure. The symptoms match your description exactly:

- Build logs show success (conda environment created)

- Service logs show gunicorn starting and workers booting

- Endpoint never becomes ready

- CPU metrics show the endpoint is active (the container IS running, it's just deadlocked)

- You are billed because compute is provisioned

- All requests time out

This does NOT happen 100% of the time, which explains the "random" nature of your issue.

Workaround: Pin your MLflow version to 3.9.0 or lower when logging your model:

import mlflow

mlflow.pyfunc.log_model(

artifact_path="model",

python_model=your_model,

pip_requirements=[

"mlflow==3.9.0",

# ... your other dependencies

]

)

After re-logging the model with the pinned version, update the endpoint to use the new model version. The engineering team is actively working on a permanent fix.

Docs: https://docs.databricks.com/aws/en/machine-learning/model-serving/custom-models

ROOT CAUSE #2: Regional Compute Capacity Exhaustion

Even if your container builds successfully, the endpoint can get stuck if there is insufficient compute capacity in your cloud region. This is especially common with GPU endpoints in popular regions (Azure East US, AWS us-east-1, etc.) during peak hours.

From the Databricks documentation: "Scale to zero is not recommended for production endpoints, as capacity is not guaranteed when scaled to zero."

When you scale to zero, you relinquish your compute capacity. Scaling back up is best-effort, first-come-first-served. If the region is at capacity, your endpoint can be stuck indefinitely.

Docs: https://docs.databricks.com/aws/en/machine-learning/model-serving/create-manage-serving-endpoints

ROOT CAUSE #3: Model/Agent Initialization Timeout

If your agent has heavy initialization (loading large models, establishing database connections, downloading artifacts), it may exceed the startup timeout. The container startup timeout is typically ~600 seconds. If your model doesn't respond to health checks within that window, the endpoint appears stuck.

WHAT YOU SHOULD DO

Step 1: Check your MLflow version. Your logs already show gunicorn 25.1.0, which strongly suggests MLflow >= 3.10.0. Pin it to 3.9.0 as described above and re-deploy.

Step 2: For production endpoints, disable scale-to-zero and set a minimum provisioned concurrency:

from databricks.sdk import WorkspaceClient

from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput

w = WorkspaceClient()

w.serving_endpoints.create(

name="my-agent-endpoint",

config=EndpointCoreConfigInput(

served_entities=[

ServedEntityInput(

entity_name="catalog.schema.my_agent",

entity_version="1",

workload_size="Small",

scale_to_zero_enabled=False,

min_provisioned_throughput=4,

)

]

),

)

Docs: https://docs.databricks.com/aws/en/machine-learning/model-serving/production-optimization

Step 3: If you need scale-to-zero for cost reasons, consider these mitigations:

- Implement retry logic with exponential backoff (first request after idle may 504)

- Use a scheduled "keep-warm" ping during business hours to prevent scaling down

- Clean up old served entities (unused model versions at 0% traffic slow startup)

Step 4: If the issue persists after pinning MLflow and the above steps, open a support ticket with your workspace ID, endpoint name, region, and timestamps of when the issue occurred. The support team can check backend capacity and control-plane issues.

QUICK DIAGNOSIS TABLE

- Build logs succeed, gunicorn starts, but hangs intermittently → MLflow gunicorn deadlock

- Events show "resource did not become available in time" → Regional capacity exhaustion

- Service logs show model loading errors or timeouts → Initialization timeout

- Issue only with GPU endpoints, CPU works → GPU capacity shortage

- Billed but shows "Scaling from zero" → Container running but failing health checks

DOCUMENTATION REFERENCES

- Create and manage serving endpoints: https://docs.databricks.com/aws/en/machine-learning/model-serving/create-manage-serving-endpoints

- Custom models (dependency management): https://docs.databricks.com/aws/en/machine-learning/model-serving/custom-models

- Production optimization: https://docs.databricks.com/aws/en/machine-learning/model-serving/production-optimization

- Deploy an agent: https://docs.databricks.com/aws/en/generative-ai/agent-framework/deploy-agent

TL;DR: Your logs showing gunicorn 25.1.0 strongly suggest the MLflow 3.10.0 gunicorn deadlock bug. Pin MLflow to 3.9.0, disable scale-to-zero for production, and if the issue persists, open a support ticket with your workspace ID and timestamps.

Hope this helps!

Databricks Community

Model Serving Endpoints Scaling From Zero Forever

FREE TRAINING: Databricks Business Impact Accelerator

DAIS 2026 Speaker Spotlight Series #15 | Julien Debard

🌟 Community Pulse: Your Weekly Roundup! May 25 – 31, 2026

Solution Accelerator Series | Recency, Frequency and Monetary (RFM) Segmentation

FLASH SALE: Save 50% on Summit Training ⚡