2 weeks ago
Is there an inbuilt method to measure latency metrics like TTFT, TBT when deploying agents on Databricks? Using MLFlow ChatAgent, ChatDatabricks/OpenAI client(workspace client)
What would be the way to measure them in case no inbuilt method exists?
2 weeks ago
Greetings @Rajat-TVSM , great question. I did some research and I am happy to help you measure streaming latency for agents on Databricks.
MLflow Tracing records inputs/outputs, spans, and operational metrics such as latency, and integrates with production monitoring; it also tracks token usage returned by LLM provider APIs. This gives you endโtoโend and perโstep latency, plus token counts/costs, but not named TTFT/TBT out of the box.
Streaming query support via the Databricks OpenAI client on deployed agents lets you consume output chunks as they arrive with stream=True, which you can instrument to compute TTFT (time to first token) and perโchunk interโarrival times (average time between tokens). Legacy chat.completions is still supported but not recommended for new agents; the responses API is preferred for new agents and supports streaming similarly.
Agent Evaluation (MLflow 2) aggregates latency and cost metrics over an evaluation set (e.g., average latency), and you can add custom metrics; for TTFT/TBT specifically, youโd implement custom code to compute and log them from traces or client timings.
If you need TTFT/TBT specifically, instrument the streaming loop on the client and optionally log metrics to MLflow. Below are idiomatic examples for both the recommended responses client and the legacy chat.completions.
import time
import numpy as np
from openai import OpenAI
client = OpenAI() # configured for Databricks OpenAI-compatible endpoints
endpoint = "<your-agent-endpoint>"
input_msgs = [{"role": "user", "content": "Explain MLflow Tracing"}]
t0 = time.perf_counter()
ttft = None
arrival_times = []
last_text_time = None
stream = client.responses.create(model=endpoint, input=input_msgs, stream=True)
for event in stream:
now = time.perf_counter()
# First delta -> TTFT
if ttft is None:
ttft = (now - t0) * 1000 # ms
arrival_times.append(now)
# Compute metrics
avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
stream_time_s = (arrival_times[-1] - arrival_times[0]) if arrival_times else None
print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms, "stream_time_s": stream_time_s})
Then combine with token usage from traces (if enabled) to compute tokens/sec:
# Example if you have access to the MLflow trace object for this request:
# tokens/sec = output_tokens / (arrival_times[-1] - arrival_times[0])
tps = trace.info.token_usage.output_tokens / (arrival_times[-1] - arrival_times[0])
This leverages streaming with responses and MLflow Tracingโs token usage fields.
import time
import numpy as np
from openai import OpenAI
client = OpenAI()
endpoint = "<your-agent-endpoint>"
messages = [{"role": "user", "content": "Explain MLflow Tracing"}]
t0 = time.perf_counter()
ttft = None
arrival_times = []
stream = client.chat.completions.create(model=endpoint, messages=messages, stream=True)
for chunk in stream:
now = time.perf_counter()
if ttft is None:
ttft = (now - t0) * 1000 # ms
arrival_times.append(now)
avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms})
Use this for existing chat.completions integrations; new agents should prefer responses.
If you author agents in code, you can instrument your predict_stream to compute and log TTFT/TBT and let MLflow Tracing aggregate the full output for you.
import time
import mlflow
from databricks.agents import ResponsesAgent, ResponsesAgentStreamEvent
class MyAgent(ResponsesAgent):
def predict_stream(self, request):
t0 = time.perf_counter()
first_time = None
arrival_times = []
for stream_event in self.agent.stream(request.input): # stream chunks from your underlying LLM/toolchain
now = time.perf_counter()
if first_time is None:
first_time = now
mlflow.log_metric("ttft_ms", (first_time - t0) * 1000)
arrival_times.append(now)
yield stream_event # forward the delta events
if len(arrival_times) > 1:
avg_tbt_ms = (sum(arrival_times[i] - arrival_times[i-1] for i in range(1, len(arrival_times))) / (len(arrival_times)-1)) * 1000
mlflow.log_metric("avg_tbt_ms", avg_tbt_ms)
Streaming events are aggregated for display and tracing; this pattern lets you attach your own metrics cleanly.
Enable MLflow Tracing in production when deploying with Agent Framework; traces are logged to MLflow experiments and can be synced to Delta tables for monitoring. Youโll get latency, errors, and token usage, and can compute downstream analytics. TTFT/TBT remain client-derived unless you create custom attributes/spans for chunk timings.
Use Agent Evaluation (MLflow 2) to see aggregated latency metrics over an evaluation set, and add custom metrics if you want TTFT/TBT computed from traces you collect during streaming.
Prefer the Databricks OpenAI responses client for new agents and streaming metrics work; keep chat.completions only for legacy code paths.
Use MLflow Tracing everywhere so you can correlate your client-side TTFT/TBT with server-side spans, token usage, and cost; export traces via OpenTelemetry if you need them in your existing observability stack.
If you need these metrics in dashboards, log them via mlflow.log_metric in your agent code and roll up by endpoint/version over time; combine with token usage for throughput and cost analyses.
2 weeks ago
Greetings @Rajat-TVSM , great question. I did some research and I am happy to help you measure streaming latency for agents on Databricks.
MLflow Tracing records inputs/outputs, spans, and operational metrics such as latency, and integrates with production monitoring; it also tracks token usage returned by LLM provider APIs. This gives you endโtoโend and perโstep latency, plus token counts/costs, but not named TTFT/TBT out of the box.
Streaming query support via the Databricks OpenAI client on deployed agents lets you consume output chunks as they arrive with stream=True, which you can instrument to compute TTFT (time to first token) and perโchunk interโarrival times (average time between tokens). Legacy chat.completions is still supported but not recommended for new agents; the responses API is preferred for new agents and supports streaming similarly.
Agent Evaluation (MLflow 2) aggregates latency and cost metrics over an evaluation set (e.g., average latency), and you can add custom metrics; for TTFT/TBT specifically, youโd implement custom code to compute and log them from traces or client timings.
If you need TTFT/TBT specifically, instrument the streaming loop on the client and optionally log metrics to MLflow. Below are idiomatic examples for both the recommended responses client and the legacy chat.completions.
import time
import numpy as np
from openai import OpenAI
client = OpenAI() # configured for Databricks OpenAI-compatible endpoints
endpoint = "<your-agent-endpoint>"
input_msgs = [{"role": "user", "content": "Explain MLflow Tracing"}]
t0 = time.perf_counter()
ttft = None
arrival_times = []
last_text_time = None
stream = client.responses.create(model=endpoint, input=input_msgs, stream=True)
for event in stream:
now = time.perf_counter()
# First delta -> TTFT
if ttft is None:
ttft = (now - t0) * 1000 # ms
arrival_times.append(now)
# Compute metrics
avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
stream_time_s = (arrival_times[-1] - arrival_times[0]) if arrival_times else None
print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms, "stream_time_s": stream_time_s})
Then combine with token usage from traces (if enabled) to compute tokens/sec:
# Example if you have access to the MLflow trace object for this request:
# tokens/sec = output_tokens / (arrival_times[-1] - arrival_times[0])
tps = trace.info.token_usage.output_tokens / (arrival_times[-1] - arrival_times[0])
This leverages streaming with responses and MLflow Tracingโs token usage fields.
import time
import numpy as np
from openai import OpenAI
client = OpenAI()
endpoint = "<your-agent-endpoint>"
messages = [{"role": "user", "content": "Explain MLflow Tracing"}]
t0 = time.perf_counter()
ttft = None
arrival_times = []
stream = client.chat.completions.create(model=endpoint, messages=messages, stream=True)
for chunk in stream:
now = time.perf_counter()
if ttft is None:
ttft = (now - t0) * 1000 # ms
arrival_times.append(now)
avg_tbt_ms = (np.diff(arrival_times).mean() * 1000) if len(arrival_times) > 1 else None
print({"ttft_ms": ttft, "avg_tbt_ms": avg_tbt_ms})
Use this for existing chat.completions integrations; new agents should prefer responses.
If you author agents in code, you can instrument your predict_stream to compute and log TTFT/TBT and let MLflow Tracing aggregate the full output for you.
import time
import mlflow
from databricks.agents import ResponsesAgent, ResponsesAgentStreamEvent
class MyAgent(ResponsesAgent):
def predict_stream(self, request):
t0 = time.perf_counter()
first_time = None
arrival_times = []
for stream_event in self.agent.stream(request.input): # stream chunks from your underlying LLM/toolchain
now = time.perf_counter()
if first_time is None:
first_time = now
mlflow.log_metric("ttft_ms", (first_time - t0) * 1000)
arrival_times.append(now)
yield stream_event # forward the delta events
if len(arrival_times) > 1:
avg_tbt_ms = (sum(arrival_times[i] - arrival_times[i-1] for i in range(1, len(arrival_times))) / (len(arrival_times)-1)) * 1000
mlflow.log_metric("avg_tbt_ms", avg_tbt_ms)
Streaming events are aggregated for display and tracing; this pattern lets you attach your own metrics cleanly.
Enable MLflow Tracing in production when deploying with Agent Framework; traces are logged to MLflow experiments and can be synced to Delta tables for monitoring. Youโll get latency, errors, and token usage, and can compute downstream analytics. TTFT/TBT remain client-derived unless you create custom attributes/spans for chunk timings.
Use Agent Evaluation (MLflow 2) to see aggregated latency metrics over an evaluation set, and add custom metrics if you want TTFT/TBT computed from traces you collect during streaming.
Prefer the Databricks OpenAI responses client for new agents and streaming metrics work; keep chat.completions only for legacy code paths.
Use MLflow Tracing everywhere so you can correlate your client-side TTFT/TBT with server-side spans, token usage, and cost; export traces via OpenTelemetry if you need them in your existing observability stack.
If you need these metrics in dashboards, log them via mlflow.log_metric in your agent code and roll up by endpoint/version over time; combine with token usage for throughput and cost analyses.
2 weeks ago
Thanks for your response Louis. If I understand it correctly, for production monitoring, we would have to rely on client side logging. Can mlflow.log_metric be integrated with traces by any chance? (Since that seems to be the only way to measure TTFT on agent/server side)
Wednesday
Youโre close: for production observability you can use serverโside tracing when you deploy agents on Databricks, and clientโside instrumentation when the app runs outside; you donโt have to rely only on the client side.
Databricksโhosted agents (Agent Framework / Model Serving): Traces are logged automatically to your MLflow experiment, and can be synced to Delta tables for monitoring and analysis. No extra client logging is required beyond enabling tracing in the agent code or endpoint config.
Apps deployed outside Databricks: Use the lightweight mlflowโtracing SDK to instrument your code and send traces to the Databricks MLflow server. This is the recommended pattern for โagent/serverโsideโ observability of external services.
What tracing captures out of the box: Inputs/outputs, intermediate steps (spans), token usage, and latency at each step, which is often enough for performance monitoring and cost tracking.
Short answer: mlflow.log_metric writes metrics to the active MLflow Run; traces are a separate object with their own metadata and attributes. The recommended pattern is to log operational values (like TTFT) as a trace/span attribute for perโrequest analysis, and optionally also log the same value as an MLflow run metric for aggregate dashboards.
Attaching data to traces: Use mlflow.update_current_trace(...) for traceโlevel metadata/tags, or span.set_attribute(...) for spanโlevel values (for example, custom latency counters).
TTFT is specific to streaming. You can compute it in your server code the moment the first token is emitted and attach it to the current span, then optionally mirror it to run metrics:
import time
import mlflow
from mlflow.entities import SpanType
mlflow.set_experiment("/Shared/prod-agent")
mlflow.openai.autolog() # or your provider autolog
@mlflow.trace(span_type=SpanType.CHAIN)
def chat_stream(prompt: str):
span = mlflow.get_current_active_span()
t0 = time.time() # request start
full_text = []
first_token_ms = None
for token in stream_llm_tokens(prompt): # your streaming call
full_text.append(token)
if first_token_ms is None:
first_token_ms = int((time.time() - t0) * 1000)
# Attach to the trace/span for per-request analysis
span.set_attribute("llm.ttft_ms", first_token_ms)
# Optionally log to the active run for aggregation
mlflow.log_metric("ttft_ms", first_token_ms)
return "".join(full_text)
Perโrequest: Search traces by your custom attribute, e.g., attributes.\llm.ttft_ms` > 500` to find slow firstโtoken cases.
Aggregate: Use the run metric ttft_ms for dashboards or timeโseries views alongside other metrics.
Automatic production tracing for Databricks deployments (Agent Framework / Model Serving) with optional sync to Delta tables for monitoring pipelines.
Token usage and latency tracking built into traces; compute cost from tokens.
Lakehouse Monitoring for GenAI and evaluation pipelines consume traces to compute operational and quality metrics at scale.
Hope this helps, Louis.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now