There are multiple possible root causes, so let me walk through them so you can diagnose which applies to your situation.
WHAT HAPPENS DURING SCALE-FROM-ZERO
When an endpoint scales from zero, Databricks must:
1. Acquire compute capacity (CPU or GPU) from the regional pool
2. Build/restore the container with your model's conda environment
3. Start the gunicorn server and boot worker processes
4. Load your model/agent into memory
5. Pass health/readiness checks so the proxy can route traffic
Your logs show gunicorn starting and workers booting, which means you're getting past steps 1-3 but likely failing at steps 4 or 5. This is an important clue.
ROOT CAUSE #1: MLflow 3.10.0 Gunicorn Deadlock (Most Likely)
I notice your logs show "Starting gunicorn 25.1.0". MLflow 3.10.0 upgraded gunicorn from 23.0.0 to 25.1.0, which introduced an intermittent system call deadlock in the model serving infrastructure. The symptoms match your description exactly:
- Build logs show success (conda environment created)
- Service logs show gunicorn starting and workers booting
- Endpoint never becomes ready
- CPU metrics show the endpoint is active (the container IS running, it's just deadlocked)
- You are billed because compute is provisioned
- All requests time out
This does NOT happen 100% of the time, which explains the "random" nature of your issue.
Workaround: Pin your MLflow version to 3.9.0 or lower when logging your model:
import mlflow
mlflow.pyfunc.log_model(
artifact_path="model",
python_model=your_model,
pip_requirements=[
"mlflow==3.9.0",
# ... your other dependencies
]
)
After re-logging the model with the pinned version, update the endpoint to use the new model version. The engineering team is actively working on a permanent fix.
ROOT CAUSE #2: Regional Compute Capacity Exhaustion
Even if your container builds successfully, the endpoint can get stuck if there is insufficient compute capacity in your cloud region. This is especially common with GPU endpoints in popular regions (Azure East US, AWS us-east-1, etc.) during peak hours.
From the Databricks documentation: "Scale to zero is not recommended for production endpoints, as capacity is not guaranteed when scaled to zero."
When you scale to zero, you relinquish your compute capacity. Scaling back up is best-effort, first-come-first-served. If the region is at capacity, your endpoint can be stuck indefinitely.
ROOT CAUSE #3: Model/Agent Initialization Timeout
If your agent has heavy initialization (loading large models, establishing database connections, downloading artifacts), it may exceed the startup timeout. The container startup timeout is typically ~600 seconds. If your model doesn't respond to health checks within that window, the endpoint appears stuck.
WHAT YOU SHOULD DO
Step 1: Check your MLflow version. Your logs already show gunicorn 25.1.0, which strongly suggests MLflow >= 3.10.0. Pin it to 3.9.0 as described above and re-deploy.
Step 2: For production endpoints, disable scale-to-zero and set a minimum provisioned concurrency:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput
w = WorkspaceClient()
w.serving_endpoints.create(
name="my-agent-endpoint",
config=EndpointCoreConfigInput(
served_entities=[
ServedEntityInput(
entity_name="catalog.schema.my_agent",
entity_version="1",
workload_size="Small",
scale_to_zero_enabled=False,
min_provisioned_throughput=4,
)
]
),
)
Step 3: If you need scale-to-zero for cost reasons, consider these mitigations:
- Implement retry logic with exponential backoff (first request after idle may 504)
- Use a scheduled "keep-warm" ping during business hours to prevent scaling down
- Clean up old served entities (unused model versions at 0% traffic slow startup)
Step 4: If the issue persists after pinning MLflow and the above steps, open a support ticket with your workspace ID, endpoint name, region, and timestamps of when the issue occurred. The support team can check backend capacity and control-plane issues.
QUICK DIAGNOSIS TABLE
- Build logs succeed, gunicorn starts, but hangs intermittently → MLflow gunicorn deadlock
- Events show "resource did not become available in time" → Regional capacity exhaustion
- Service logs show model loading errors or timeouts → Initialization timeout
- Issue only with GPU endpoints, CPU works → GPU capacity shortage
- Billed but shows "Scaling from zero" → Container running but failing health checks
DOCUMENTATION REFERENCES
TL;DR: Your logs showing gunicorn 25.1.0 strongly suggest the MLflow 3.10.0 gunicorn deadlock bug. Pin MLflow to 3.9.0, disable scale-to-zero for production, and if the issue persists, open a support ticket with your workspace ID and timestamps.
Hope this helps!