cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
bennovak
Databricks Employee
Databricks Employee

In a recent engagement, I partnered with a customer who had successfully productionalized sophisticated AI use cases using Genie Spaces, Multi-Agent Supervisor systems, and managed MCP servers. Their innovation velocity was impressive.

As part of a workspace review, we identified an opportunity to take action on over $150,000 in annualized cost savings through improved Model Serving hygiene.

 

Elevating the User Experience: Making Optimization Visible

While Scale-to-Zero guidance is clearly documented (Section 2.vii), identifying steady-state serving patterns directly through the UI can be less intuitive at scale. By leveraging system-level telemetry, we were able to bring clarity to usage behavior that wasn’t immediately obvious in the Serving tab.

In this workspace, several Model Serving Endpoints displayed as “Not Ready” in the UI, yet underlying compute remained active in a consistent pattern. Optimizing these endpoints resulted in approximately $9,000 per month in cost reduction, without impacting production workloads.

 

Defining “Steady-State Usage” for Optimization

“Steady-state usage” can be understood as endpoints that look perfectly flat over time. If compute consumption doesn’t move at all, down to the decimal, for four straight weeks, that’s usually a sign the workload isn’t behaving dynamically. It doesn’t mean something’s wrong, but it does mean it’s worth a quick optimization check.

For analytical purposes, we defined a measurable signal of optimization opportunity:

Four consecutive weeks of identical usage (to the second decimal place).

This pattern can reasonably occur in three scenarios:

  1. Compute remains provisioned but receives no requests.
  2. Compute receives identical request volumes weekly with no variation in inputs.
  3. Scale-to-Zero endpoints receive highly regular synthetic or repeated requests (e.g., every five minutes) that prevent scaling down.

It’s important to note:

Not all steady-state usage is inappropriate.

Some production workloads are intentionally configured to avoid scale latency. The goal is not automatic pausing, it’s visibility and intentionality.

Optimization is about ensuring resources reflect business intent.

 

Identification Methods

For Databricks Teams (Account-Level Visibility)

An internal Logfood Dashboard Steady-State Model Serving Endpoint Usage enables filtering by Account to review Model Serving consumption across Workspaces.

Optimization Signal:
Identical weekly DBU usage across multiple consecutive weeks.

bennovak_0-1771601689209.png

Expected Healthy Pattern:
Natural variation in weekly usage reflecting real demand.

bennovak_1-1771601698622.png

bennovak_2-1771601707258.png

For Workspace Admins

Workspace Admins can independently perform endpoint-level analysis using System Tables:

SELECT

 u.workspace_id,

 custom_tags.EndpointId,

 usage_metadata.endpoint_name,

 u.billing_origin_product,

 date_format(usage_start_time, 'yyyy-MM-dd') as usage_week,

 sum(u.usage_quantity) as usage_quantity

FROM system.billing.usage u

WHERE u.usage_date BETWEEN dateadd(DAY,-365,current_date()) AND current_date()

AND billing_origin_product in ('MODEL_SERVING', 'VECTOR_SEARCH')

GROUP BY 1,2,3,4,5

ORDER BY usage_metadata.endpoint_name desc, usage_week desc

Optimization Signal:
Consistent, identical DBU consumption across multiple weeks for a given endpoint.

bennovak_3-1771601718676.png

Expected Behavior:
Small to large natural variance in week-to-week usage.

bennovak_4-1771601718676.png

 

Optimization Workflow

The dashboard and query provide the observability layer. From there:

  1. Review usage patterns.
  2. Confirm business intent with endpoint owners.
  3. If appropriate, pause or enable Scale-to-Zero.
  4. Reallocate savings to higher-value workloads.

The key step is alignment, not automation. Some endpoints are intentionally persistent. Others represent experimentation or retired use cases that can now be right-sized.

 

The Bigger Picture: Operational Maturity in AI Workloads

Even highly sophisticated teams benefit from structured workspace hygiene reviews. As AI innovation accelerates, experimentation naturally expands footprint.

By establishing clear signals for steady-state usage and leveraging System Tables for observability, organizations can:

  • Increase cost efficiency
  • Improve resource allocation
  • Reinforce governance best practices
  • Maintain performance where it matters most

What began as a simple review resulted in six-figure annual savings, not through reduction, but through refinement.