topic Re: Measuring latency metrics like TTFT, TBT when deploying agents on Databricks in Generative AI

Measuring latency metrics like TTFT, TBT when deploying agents on Databricks

actualhuman_012 — Sat, 08 Nov 2025 09:06:38 GMT

Is there an inbuilt method to measure latency metrics like TTFT, TBT when deploying agents on Databricks? Using MLFlow ChatAgent, ChatDatabricks/OpenAI client(workspace client)

What would be the way to measure them in case no inbuilt method exists?

Re: Measuring latency metrics like TTFT, TBT when deploying agents on Databricks

Louis_Frolio — Sat, 08 Nov 2025 21:45:45 GMT

Greetings @actualhuman_012 , great question. I did some research and I am happy to help you measure streaming latency for agents on Databricks.

What’s available out of the box

MLflow Tracing records inputs/outputs, spans, and operational metrics such as latency, and integrates with production monitoring; it also tracks token usage returned by LLM provider APIs. This gives you end‑to‑end and per‑step latency, plus token counts/costs, but not named TTFT/TBT out of the box.
Streaming query support via the Databricks OpenAI client on deployed agents lets you consume output chunks as they arrive with stream=True, which you can instrument to compute TTFT (time to first token) and per‑chunk inter‑arrival times (average time between tokens). Legacy chat.completions is still supported but not recommended for new agents; the responses API is preferred for new agents and supports streaming similarly.
Agent Evaluation (MLflow 2) aggregates latency and cost metrics over an evaluation set (e.g., average latency), and you can add custom metrics; for TTFT/TBT specifically, you’d implement custom code to compute and log them from traces or client timings.

Practical ways to capture TTFT and TBT

If you need TTFT/TBT specifically, instrument the streaming loop on the client and optionally log metrics to MLflow. Below are idiomatic examples for both the recommended responses client and the legacy chat.completions.

Databricks OpenAI client (responses) — compute TTFT, avg TBT, throughput

import time
import numpy as np
from openai import OpenAI

client = OpenAI()  # configured for Databricks OpenAI-compatible endpoints
endpoint = "<your-agent-endpoint>"
input_msgs = [{"role": "user", "content": "Explain MLflow Tracing"}]

t0 = time.perf_counter()
ttft = None
arrival_times = []
last_text_time = None

stream = client.responses.create(model=endpoint, input=input_msgs, stream=True)
for event in stream:
    now = time.perf_counter()
    # First delta -> TTFT
    if ttft is None:
        ttft = (now - t0) * 1000  # ms
    arrival_times.append(now)

# Compute metrics
avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
stream_time_s = (arrival_times[-1] - arrival_times[0]) if arrival_times else None

print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms, "stream_time_s": stream_time_s})

python

Then combine with token usage from traces (if enabled) to compute tokens/sec:

# Example if you have access to the MLflow trace object for this request:
# tokens/sec = output_tokens / (arrival_times[-1] - arrival_times[0])
tps = trace.info.token_usage.output_tokens / (arrival_times[-1] - arrival_times[0])

python

This leverages streaming with responses and MLflow Tracing’s token usage fields.

Databricks OpenAI client (legacy chat.completions) — compute TTFT and avg TBT

import time
import numpy as np
from openai import OpenAI

client = OpenAI()
endpoint = "<your-agent-endpoint>"
messages = [{"role": "user", "content": "Explain MLflow Tracing"}]

t0 = time.perf_counter()
ttft = None
arrival_times = []

stream = client.chat.completions.create(model=endpoint, messages=messages, stream=True)
for chunk in stream:
    now = time.perf_counter()
    if ttft is None:
        ttft = (now - t0) * 1000  # ms
    arrival_times.append(now)

avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms})

python

Use this for existing chat.completions integrations; new agents should prefer responses.

Logging TTFT/TBT within a ResponsesAgent implementation

If you author agents in code, you can instrument your predict_stream to compute and log TTFT/TBT and let MLflow Tracing aggregate the full output for you.

import time
import mlflow
from databricks.agents import ResponsesAgent, ResponsesAgentStreamEvent

class MyAgent(ResponsesAgent):
    def predict_stream(self, request):
        t0 = time.perf_counter()
        first_time = None
        arrival_times = []

        for stream_event in self.agent.stream(request.input):  # stream chunks from your underlying LLM/toolchain
            now = time.perf_counter()
            if first_time is None:
                first_time = now
                mlflow.log_metric("ttft_ms", (first_time - t0) * 1000)

            arrival_times.append(now)
            yield stream_event  # forward the delta events

        if len(arrival_times) > 1:
            avg_tbt_ms = (sum(arrival_times[i] - arrival_times[i-1] for i in range(1, len(arrival_times))) / (len(arrival_times)-1)) * 1000
            mlflow.log_metric("avg_tbt_ms", avg_tbt_ms)

python

Streaming events are aggregated for display and tracing; this pattern lets you attach your own metrics cleanly.

Server-side observability and aggregation

Enable MLflow Tracing in production when deploying with Agent Framework; traces are logged to MLflow experiments and can be synced to Delta tables for monitoring. You’ll get latency, errors, and token usage, and can compute downstream analytics. TTFT/TBT remain client-derived unless you create custom attributes/spans for chunk timings.
Use Agent Evaluation (MLflow 2) to see aggregated latency metrics over an evaluation set, and add custom metrics if you want TTFT/TBT computed from traces you collect during streaming.

Guidance

Prefer the Databricks OpenAI responses client for new agents and streaming metrics work; keep chat.completions only for legacy code paths.
Use MLflow Tracing everywhere so you can correlate your client-side TTFT/TBT with server-side spans, token usage, and cost; export traces via OpenTelemetry if you need them in your existing observability stack.
If you need these metrics in dashboards, log them via mlflow.log_metric in your agent code and roll up by endpoint/version over time; combine with token usage for throughput and cost analyses.

Hope this helps, Louis.

Re: Measuring latency metrics like TTFT, TBT when deploying agents on Databricks

actualhuman_012 — Mon, 10 Nov 2025 06:15:20 GMT

Thanks for your response Louis. If I understand it correctly, for production monitoring, we would have to rely on client side logging. Can mlflow.log_metric be integrated with traces by any chance? (Since that seems to be the only way to measure TTFT on agent/server side)

Re: Measuring latency metrics like TTFT, TBT when deploying agents on Databricks

Louis_Frolio — Wed, 19 Nov 2025 18:55:26 GMT

@actualhuman_012 ,

You’re close: for production observability you can use server‑side tracing when you deploy agents on Databricks, and client‑side instrumentation when the app runs outside; you don’t have to rely only on the client side.

What to use in production

Databricks‑hosted agents (Agent Framework / Model Serving): Traces are logged automatically to your MLflow experiment, and can be synced to Delta tables for monitoring and analysis. No extra client logging is required beyond enabling tracing in the agent code or endpoint config.
Apps deployed outside Databricks: Use the lightweight mlflow‑tracing SDK to instrument your code and send traces to the Databricks MLflow server. This is the recommended pattern for “agent/server‑side” observability of external services.
What tracing captures out of the box: Inputs/outputs, intermediate steps (spans), token usage, and latency at each step, which is often enough for performance monitoring and cost tracking.

Can mlflow.log_metric be “integrated with” traces?

Short answer: mlflow.log_metric writes metrics to the active MLflow Run; traces are a separate object with their own metadata and attributes. The recommended pattern is to log operational values (like TTFT) as a trace/span attribute for per‑request analysis, and optionally also log the same value as an MLflow run metric for aggregate dashboards.
Attaching data to traces: Use mlflow.update_current_trace(...) for trace‑level metadata/tags, or span.set_attribute(...) for span‑level values (for example, custom latency counters).

Measuring TTFT (time‑to‑first‑token)

TTFT is specific to streaming. You can compute it in your server code the moment the first token is emitted and attach it to the current span, then optionally mirror it to run metrics:

import time
import mlflow
from mlflow.entities import SpanType

mlflow.set_experiment("/Shared/prod-agent")
mlflow.openai.autolog()  # or your provider autolog

@mlflow.trace(span_type=SpanType.CHAIN)
def chat_stream(prompt: str):
    span = mlflow.get_current_active_span()
    t0 = time.time()  # request start

    full_text = []
    first_token_ms = None

    for token in stream_llm_tokens(prompt):  # your streaming call
        full_text.append(token)
        if first_token_ms is None:
            first_token_ms = int((time.time() - t0) * 1000)
            # Attach to the trace/span for per-request analysis
            span.set_attribute("llm.ttft_ms", first_token_ms)
            # Optionally log to the active run for aggregation
            mlflow.log_metric("ttft_ms", first_token_ms)

    return "".join(full_text)

This follows the same approach as MLflow’s streaming examples (capture time context and update the active trace), and pairs with MLflow’s trace timing properties for overall latency.

Querying TTFT later

Per‑request: Search traces by your custom attribute, e.g., attributes.\llm.ttft_ms` > 500` to find slow first‑token cases.
Aggregate: Use the run metric ttft_ms for dashboards or time‑series views alongside other metrics.

Related capabilities you may want

Automatic production tracing for Databricks deployments (Agent Framework / Model Serving) with optional sync to Delta tables for monitoring pipelines.
Token usage and latency tracking built into traces; compute cost from tokens.

Lakehouse Monitoring for GenAI and evaluation pipelines consume traces to compute operational and quality metrics at scale.

Hope this helps, Louis.