Hey @KyraHinnegan , I did some digging and here is what I found: Based on the Databricks documentation, GPU metrics exposed by the Serving Endpoint Metrics API follow a clear and consistent naming convention. Once you know the pattern, the response is very predictable and easy to work with.
GPU metric keys
The API exposes two GPU-specific metrics, each broken out per individual GPU on the serving instance.
GPU usage
You’ll see GPU utilization reported using the following key pattern:
-
gpu_usage_percentage{gpu=“gpu0”}
-
gpu_usage_percentage{gpu=“gpu1”}
-
gpu_usage_percentage{gpu=“gpuN”}
Each GPU is tracked independently using the gpu label. Values like gpu0, gpu1, and so on correspond to the physical GPUs attached to the instance.
GPU memory usage
GPU memory utilization follows the same labeling approach:
-
gpu_memory_usage_percentage{gpu=“gpu0”}
-
gpu_memory_usage_percentage{gpu=“gpu1”}
-
gpu_memory_usage_percentage{gpu=“gpuN”}
Again, memory usage is reported per GPU device, making it straightforward to see how evenly (or unevenly) memory pressure is distributed.
Response format
All metrics are returned using the Prometheus / OpenMetrics exposition format. In practice, a response containing GPU metrics will look something like this:
# TYPE gpu_usage_percentage gauge
gpu_usage_percentage{gpu="gpu0",endpoint="your-endpoint-name"} 45.2
gpu_usage_percentage{gpu="gpu1",endpoint="your-endpoint-name"} 52.8
# TYPE gpu_memory_usage_percentage gauge
gpu_memory_usage_percentage{gpu="gpu0",endpoint="your-endpoint-name"} 68.5
gpu_memory_usage_percentage{gpu="gpu1",endpoint="your-endpoint-name"} 71.3
This structure makes it easy to scrape, aggregate, and visualize the metrics using standard Prometheus tooling.
Important notes and gotchas
A few practical details are worth keeping in mind:
-
These values are averages across all server replicas and are sampled once per minute.
-
Because of the relatively low sampling frequency, the metrics are most accurate when the endpoint is under steady, sustained load.
-
Put simply: the metric schema scales naturally with the hardware you provision, and each GPU shows up as its own labeled time series.
As always, if you’re planning alerts or capacity decisions, it’s worth correlating these metrics with request volume and latency to get the full picture.
Hope this helps, Louis.