cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Measuring latency metrics like TTFT, TBT when deploying agents on Databricks

Rajat-TVSM
New Contributor III

Is there an inbuilt method to measure latency metrics like TTFT, TBT when deploying agents on Databricks? Using MLFlow ChatAgent, ChatDatabricks/OpenAI client(workspace client)

What would be the way to measure them in case no inbuilt method exists?

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @Rajat-TVSM , great question. I did some research and I am happy to help you measure streaming latency for agents on Databricks.

 

Whatโ€™s available out of the box

  • MLflow Tracing records inputs/outputs, spans, and operational metrics such as latency, and integrates with production monitoring; it also tracks token usage returned by LLM provider APIs. This gives you endโ€‘toโ€‘end and perโ€‘step latency, plus token counts/costs, but not named TTFT/TBT out of the box.

  • Streaming query support via the Databricks OpenAI client on deployed agents lets you consume output chunks as they arrive with stream=True, which you can instrument to compute TTFT (time to first token) and perโ€‘chunk interโ€‘arrival times (average time between tokens). Legacy chat.completions is still supported but not recommended for new agents; the responses API is preferred for new agents and supports streaming similarly.

  • Agent Evaluation (MLflow 2) aggregates latency and cost metrics over an evaluation set (e.g., average latency), and you can add custom metrics; for TTFT/TBT specifically, youโ€™d implement custom code to compute and log them from traces or client timings.

Practical ways to capture TTFT and TBT

If you need TTFT/TBT specifically, instrument the streaming loop on the client and optionally log metrics to MLflow. Below are idiomatic examples for both the recommended responses client and the legacy chat.completions.

Databricks OpenAI client (responses) โ€” compute TTFT, avg TBT, throughput

import time
import numpy as np
from openai import OpenAI

client = OpenAI()  # configured for Databricks OpenAI-compatible endpoints
endpoint = "<your-agent-endpoint>"
input_msgs = [{"role": "user", "content": "Explain MLflow Tracing"}]

t0 = time.perf_counter()
ttft = None
arrival_times = []
last_text_time = None

stream = client.responses.create(model=endpoint, input=input_msgs, stream=True)
for event in stream:
    now = time.perf_counter()
    # First delta -> TTFT
    if ttft is None:
        ttft = (now - t0) * 1000  # ms
    arrival_times.append(now)

# Compute metrics
avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
stream_time_s = (arrival_times[-1] - arrival_times[0]) if arrival_times else None

print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms, "stream_time_s": stream_time_s})
python

Then combine with token usage from traces (if enabled) to compute tokens/sec:

# Example if you have access to the MLflow trace object for this request:
# tokens/sec = output_tokens / (arrival_times[-1] - arrival_times[0])
tps = trace.info.token_usage.output_tokens / (arrival_times[-1] - arrival_times[0])
 
python

This leverages streaming with responses and MLflow Tracingโ€™s token usage fields.

Databricks OpenAI client (legacy chat.completions) โ€” compute TTFT and avg TBT

import time
import numpy as np
from openai import OpenAI

client = OpenAI()
endpoint = "<your-agent-endpoint>"
messages = [{"role": "user", "content": "Explain MLflow Tracing"}]

t0 = time.perf_counter()
ttft = None
arrival_times = []

stream = client.chat.completions.create(model=endpoint, messages=messages, stream=True)
for chunk in stream:
    now = time.perf_counter()
    if ttft is None:
        ttft = (now - t0) * 1000  # ms
    arrival_times.append(now)

avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms})
python

Use this for existing chat.completions integrations; new agents should prefer responses.

Logging TTFT/TBT within a ResponsesAgent implementation

If you author agents in code, you can instrument your predict_stream to compute and log TTFT/TBT and let MLflow Tracing aggregate the full output for you.

import time
import mlflow
from databricks.agents import ResponsesAgent, ResponsesAgentStreamEvent

class MyAgent(ResponsesAgent):
    def predict_stream(self, request):
        t0 = time.perf_counter()
        first_time = None
        arrival_times = []

        for stream_event in self.agent.stream(request.input):  # stream chunks from your underlying LLM/toolchain
            now = time.perf_counter()
            if first_time is None:
                first_time = now
                mlflow.log_metric("ttft_ms", (first_time - t0) * 1000)

            arrival_times.append(now)
            yield stream_event  # forward the delta events

        if len(arrival_times) > 1:
            avg_tbt_ms = (sum(arrival_times[i] - arrival_times[i-1] for i in range(1, len(arrival_times))) / (len(arrival_times)-1)) * 1000
            mlflow.log_metric("avg_tbt_ms", avg_tbt_ms)
python

Streaming events are aggregated for display and tracing; this pattern lets you attach your own metrics cleanly.

Server-side observability and aggregation

  • Enable MLflow Tracing in production when deploying with Agent Framework; traces are logged to MLflow experiments and can be synced to Delta tables for monitoring. Youโ€™ll get latency, errors, and token usage, and can compute downstream analytics. TTFT/TBT remain client-derived unless you create custom attributes/spans for chunk timings.

  • Use Agent Evaluation (MLflow 2) to see aggregated latency metrics over an evaluation set, and add custom metrics if you want TTFT/TBT computed from traces you collect during streaming.

Guidance

  • Prefer the Databricks OpenAI responses client for new agents and streaming metrics work; keep chat.completions only for legacy code paths.

  • Use MLflow Tracing everywhere so you can correlate your client-side TTFT/TBT with server-side spans, token usage, and cost; export traces via OpenTelemetry if you need them in your existing observability stack.

  • If you need these metrics in dashboards, log them via mlflow.log_metric in your agent code and roll up by endpoint/version over time; combine with token usage for throughput and cost analyses.

Hope this helps, Louis.