Hey @KyraHinnegan,
I did some digging and here is what I found. Hopefully it helps you understand a bit more about what is going on.
At a high level, not every endpoint type exposes infrastructure health metrics via /metrics. What you’re seeing with FOUNDATION_MODEL_API returning a 404 is expected right now.
The /api/2.0/serving-endpoints/[ENDPOINT_NAME]/metrics endpoint is the health metrics exporter. This is where you get latency, request rate, error rate, CPU, memory, GPU — all the signals you’d typically pipe into Prometheus or Datadog.
That exporter is wired up for what I’d call the “standard” Mosaic AI Model Serving endpoints. Concretely, that includes endpoints where the served entities are:
Where things diverge is with the Databricks-hosted Foundation Model API endpoints — the pay-per-token ones. These show up as FOUNDATION_MODEL_API when you call /serving-endpoints/list.
Those are backed by a separate multi-tenant system, and as of today, they don’t expose infrastructure health metrics through /metrics. So a 404 there isn’t a misconfiguration — it’s just the current behavior. What you’re seeing lines up with that.
There are also a couple of edge cases worth keeping in mind.
First, some newer foundation models (for example, Llama 4 Maverick) don’t have full metrics support yet. Even when they’re provisioned throughput endpoints, the metrics surface can be more limited than what you’d get with a typical custom model.
Second, endpoint state matters. If an endpoint isn’t in a healthy READY state — say it’s still provisioning, failed, or mid-update — /metrics can fail. In practice, you want to treat READY as a prerequisite before even attempting the call.
Given your goal — avoiding /metrics calls where they’re guaranteed to fail — the cleanest approach is to filter up front based on /serving-endpoints/list.
Only attempt /metrics when:
-
state.ready is true and the endpoint is not updating or being deleted, and
-
every served_entities[*].entity_type is one of:
That gives you a deterministic way to avoid 404s while still collecting metrics everywhere they actually exist.
If you need telemetry for the Foundation Model API endpoints — things like token usage — that lives in AI Gateway usage tracking and the system.serving.endpoint_usage and system.serving.served_entities system tables, not /metrics. So it’s a different surface area depending on what you’re measuring.
To be clear, nothing is broken here. It’s just two different endpoint architectures with different observability paths, and your filter is the right way to reconcile them.
Hope this helps.
Cheers, Louis