Greetings @Rajat-TVSM , great question. I did some research and I am happy to help you measure streaming latency for agents on Databricks.
Whatโs available out of the box
-
MLflow Tracing records inputs/outputs, spans, and operational metrics such as latency, and integrates with production monitoring; it also tracks token usage returned by LLM provider APIs. This gives you endโtoโend and perโstep latency, plus token counts/costs, but not named TTFT/TBT out of the box.
-
Streaming query support via the Databricks OpenAI client on deployed agents lets you consume output chunks as they arrive with stream=True, which you can instrument to compute TTFT (time to first token) and perโchunk interโarrival times (average time between tokens). Legacy chat.completions is still supported but not recommended for new agents; the responses API is preferred for new agents and supports streaming similarly.
-
Agent Evaluation (MLflow 2) aggregates latency and cost metrics over an evaluation set (e.g., average latency), and you can add custom metrics; for TTFT/TBT specifically, youโd implement custom code to compute and log them from traces or client timings.
Practical ways to capture TTFT and TBT
If you need TTFT/TBT specifically, instrument the streaming loop on the client and optionally log metrics to MLflow. Below are idiomatic examples for both the recommended responses client and the legacy chat.completions.
Databricks OpenAI client (responses) โ compute TTFT, avg TBT, throughput
import time
import numpy as np
from openai import OpenAI
client = OpenAI()
endpoint = "<your-agent-endpoint>"
input_msgs = [{"role": "user", "content": "Explain MLflow Tracing"}]
t0 = time.perf_counter()
ttft = None
arrival_times = []
last_text_time = None
stream = client.responses.create(model=endpoint, input=input_msgs, stream=True)
for event in stream:
now = time.perf_counter()
if ttft is None:
ttft = (now - t0) * 1000
arrival_times.append(now)
avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
stream_time_s = (arrival_times[-1] - arrival_times[0]) if arrival_times else None
print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms, "stream_time_s": stream_time_s})
Then combine with token usage from traces (if enabled) to compute tokens/sec:
tps = trace.info.token_usage.output_tokens / (arrival_times[-1] - arrival_times[0])
This leverages streaming with responses and MLflow Tracingโs token usage fields.
Databricks OpenAI client (legacy chat.completions) โ compute TTFT and avg TBT
import time
import numpy as np
from openai import OpenAI
client = OpenAI()
endpoint = "<your-agent-endpoint>"
messages = [{"role": "user", "content": "Explain MLflow Tracing"}]
t0 = time.perf_counter()
ttft = None
arrival_times = []
stream = client.chat.completions.create(model=endpoint, messages=messages, stream=True)
for chunk in stream:
now = time.perf_counter()
if ttft is None:
ttft = (now - t0) * 1000
arrival_times.append(now)
avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms})
Use this for existing chat.completions integrations; new agents should prefer responses.
Logging TTFT/TBT within a ResponsesAgent implementation
If you author agents in code, you can instrument your predict_stream to compute and log TTFT/TBT and let MLflow Tracing aggregate the full output for you.
import time
import mlflow
from databricks.agents import ResponsesAgent, ResponsesAgentStreamEvent
class MyAgent(ResponsesAgent):
def predict_stream(self, request):
t0 = time.perf_counter()
first_time = None
arrival_times = []
for stream_event in self.agent.stream(request.input):
now = time.perf_counter()
if first_time is None:
first_time = now
mlflow.log_metric("ttft_ms", (first_time - t0) * 1000)
arrival_times.append(now)
yield stream_event
if len(arrival_times) > 1:
avg_tbt_ms = (sum(arrival_times[i] - arrival_times[i-1] for i in range(1, len(arrival_times))) / (len(arrival_times)-1)) * 1000
mlflow.log_metric("avg_tbt_ms", avg_tbt_ms)
Streaming events are aggregated for display and tracing; this pattern lets you attach your own metrics cleanly.
Server-side observability and aggregation
-
Enable MLflow Tracing in production when deploying with Agent Framework; traces are logged to MLflow experiments and can be synced to Delta tables for monitoring. Youโll get latency, errors, and token usage, and can compute downstream analytics. TTFT/TBT remain client-derived unless you create custom attributes/spans for chunk timings.
-
Use Agent Evaluation (MLflow 2) to see aggregated latency metrics over an evaluation set, and add custom metrics if you want TTFT/TBT computed from traces you collect during streaming.
Guidance
-
Prefer the Databricks OpenAI responses client for new agents and streaming metrics work; keep chat.completions only for legacy code paths.
-
Use MLflow Tracing everywhere so you can correlate your client-side TTFT/TBT with server-side spans, token usage, and cost; export traces via OpenTelemetry if you need them in your existing observability stack.
-
If you need these metrics in dashboards, log them via mlflow.log_metric in your agent code and roll up by endpoint/version over time; combine with token usage for throughput and cost analyses.
Hope this helps, Louis.