<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Measuring latency metrics like TTFT, TBT when deploying agents on Databricks in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/138338#M1367</link>
    <description>&lt;P&gt;Thanks for your response Louis. If I understand it correctly, for production monitoring, we would have to rely on client side logging. Can mlflow.log_metric be integrated with traces by any chance? (Since that seems to be the only way to measure TTFT on agent/server side)&lt;/P&gt;</description>
    <pubDate>Mon, 10 Nov 2025 06:15:20 GMT</pubDate>
    <dc:creator>Rajat-TVSM</dc:creator>
    <dc:date>2025-11-10T06:15:20Z</dc:date>
    <item>
      <title>Measuring latency metrics like TTFT, TBT when deploying agents on Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/138208#M1357</link>
      <description>&lt;P&gt;Is there an inbuilt method to measure latency metrics like TTFT, TBT when deploying agents on Databricks? Using MLFlow ChatAgent, ChatDatabricks/OpenAI client(workspace client)&lt;/P&gt;&lt;P&gt;What would be the way to measure them in case no inbuilt method exists?&lt;/P&gt;</description>
      <pubDate>Sat, 08 Nov 2025 09:06:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/138208#M1357</guid>
      <dc:creator>Rajat-TVSM</dc:creator>
      <dc:date>2025-11-08T09:06:38Z</dc:date>
    </item>
    <item>
      <title>Re: Measuring latency metrics like TTFT, TBT when deploying agents on Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/138228#M1361</link>
      <description>&lt;P&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/160119"&gt;@Rajat-TVSM&lt;/a&gt;&amp;nbsp;, great question. I did some research and I am h&lt;SPAN&gt;appy to help you measure streaming latency for agents on Databricks.&lt;/SPAN&gt;&lt;/P&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;What’s available out of the box&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;MLflow Tracing&lt;/STRONG&gt; records inputs/outputs, spans, and operational metrics such as latency, and integrates with production monitoring; it also tracks token usage returned by LLM provider APIs. This gives you end‑to‑end and per‑step latency, plus token counts/costs, but not named TTFT/TBT out of the box.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Streaming query support&lt;/STRONG&gt; via the Databricks OpenAI client on deployed agents lets you consume output chunks as they arrive with &lt;CODE class="qt3gz9f"&gt;stream=True&lt;/CODE&gt;, which you can instrument to compute TTFT (time to first token) and per‑chunk inter‑arrival times (average time between tokens). Legacy &lt;CODE class="qt3gz9f"&gt;chat.completions&lt;/CODE&gt; is still supported but not recommended for new agents; the &lt;CODE class="qt3gz9f"&gt;responses&lt;/CODE&gt; API is preferred for new agents and supports streaming similarly.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Agent Evaluation (MLflow 2)&lt;/STRONG&gt; aggregates latency and cost metrics over an evaluation set (e.g., average latency), and you can add custom metrics; for TTFT/TBT specifically, you’d implement custom code to compute and log them from traces or client timings.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Practical ways to capture TTFT and TBT&lt;/H3&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you need TTFT/TBT specifically, instrument the streaming loop on the client and optionally log metrics to MLflow. Below are idiomatic examples for both the recommended &lt;CODE class="qt3gz9f"&gt;responses&lt;/CODE&gt; client and the legacy &lt;CODE class="qt3gz9f"&gt;chat.completions&lt;/CODE&gt;.&lt;/P&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;Databricks OpenAI client (responses) — compute TTFT, avg TBT, throughput&lt;/H4&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; time
&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; numpy &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; np
&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; openai &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; OpenAI

client = OpenAI()  &lt;SPAN class="hljs-comment"&gt;# configured for Databricks OpenAI-compatible endpoints&lt;/SPAN&gt;
endpoint = &lt;SPAN class="hljs-string"&gt;"&amp;lt;your-agent-endpoint&amp;gt;"&lt;/SPAN&gt;
input_msgs = [{&lt;SPAN class="hljs-string"&gt;"role"&lt;/SPAN&gt;: &lt;SPAN class="hljs-string"&gt;"user"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"content"&lt;/SPAN&gt;: &lt;SPAN class="hljs-string"&gt;"Explain MLflow Tracing"&lt;/SPAN&gt;}]

t0 = time.perf_counter()
ttft = &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;
arrival_times = []
last_text_time = &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;

stream = client.responses.create(model=endpoint, &lt;SPAN class="hljs-built_in"&gt;input&lt;/SPAN&gt;=input_msgs, stream=&lt;SPAN class="hljs-literal"&gt;True&lt;/SPAN&gt;)
&lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; event &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; stream:
    now = time.perf_counter()
    &lt;SPAN class="hljs-comment"&gt;# First delta -&amp;gt; TTFT&lt;/SPAN&gt;
    &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; ttft &lt;SPAN class="hljs-keyword"&gt;is&lt;/SPAN&gt; &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;:
        ttft = (now - t0) * &lt;SPAN class="hljs-number"&gt;1000&lt;/SPAN&gt;  &lt;SPAN class="hljs-comment"&gt;# ms&lt;/SPAN&gt;
    arrival_times.append(now)

&lt;SPAN class="hljs-comment"&gt;# Compute metrics&lt;/SPAN&gt;
avg_tbt_ms = (np.diff(arrival_times).mean() * &lt;SPAN class="hljs-number"&gt;1000&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; &lt;SPAN class="hljs-built_in"&gt;len&lt;/SPAN&gt;(arrival_times) &amp;gt; &lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;else&lt;/SPAN&gt; &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;
stream_time_s = (arrival_times[-&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;] - arrival_times[&lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;]) &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; arrival_times &lt;SPAN class="hljs-keyword"&gt;else&lt;/SPAN&gt; &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;

&lt;SPAN class="hljs-built_in"&gt;print&lt;/SPAN&gt;({&lt;SPAN class="hljs-string"&gt;"ttft_ms"&lt;/SPAN&gt;: ttft, &lt;SPAN class="hljs-string"&gt;"avg_tbt_ms"&lt;/SPAN&gt;: avg_tbt_ms, &lt;SPAN class="hljs-string"&gt;"stream_time_s"&lt;/SPAN&gt;: stream_time_s})&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;
&lt;DIV class="go8b9g5 _7pq7t6ch"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="qt3gz91 paragraph"&gt;Then combine with token usage from traces (if enabled) to compute tokens/sec:&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-comment"&gt;# Example if you have access to the MLflow trace object for this request:&lt;/SPAN&gt;
&lt;SPAN class="hljs-comment"&gt;# tokens/sec = output_tokens / (arrival_times[-1] - arrival_times[0])&lt;/SPAN&gt;
tps = trace.info.token_usage.output_tokens / (arrival_times[-&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;] - arrival_times[&lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;])&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;
&lt;DIV class="go8b9g5 _7pq7t6ch"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="go8b9g5 _7pq7t6ch"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="qt3gz91 paragraph"&gt;This leverages streaming with &lt;CODE class="qt3gz9f"&gt;responses&lt;/CODE&gt; and MLflow Tracing’s token usage fields.&lt;/P&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;Databricks OpenAI client (legacy chat.completions) — compute TTFT and avg TBT&lt;/H4&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; time
&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; numpy &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; np
&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; openai &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; OpenAI

client = OpenAI()
endpoint = &lt;SPAN class="hljs-string"&gt;"&amp;lt;your-agent-endpoint&amp;gt;"&lt;/SPAN&gt;
messages = [{&lt;SPAN class="hljs-string"&gt;"role"&lt;/SPAN&gt;: &lt;SPAN class="hljs-string"&gt;"user"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"content"&lt;/SPAN&gt;: &lt;SPAN class="hljs-string"&gt;"Explain MLflow Tracing"&lt;/SPAN&gt;}]

t0 = time.perf_counter()
ttft = &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;
arrival_times = []

stream = client.chat.completions.create(model=endpoint, messages=messages, stream=&lt;SPAN class="hljs-literal"&gt;True&lt;/SPAN&gt;)
&lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; chunk &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; stream:
    now = time.perf_counter()
    &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; ttft &lt;SPAN class="hljs-keyword"&gt;is&lt;/SPAN&gt; &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;:
        ttft = (now - t0) * &lt;SPAN class="hljs-number"&gt;1000&lt;/SPAN&gt;  &lt;SPAN class="hljs-comment"&gt;# ms&lt;/SPAN&gt;
    arrival_times.append(now)

avg_tbt_ms = (np.diff(arrival_times).mean() * &lt;SPAN class="hljs-number"&gt;1000&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; &lt;SPAN class="hljs-built_in"&gt;len&lt;/SPAN&gt;(arrival_times) &amp;gt; &lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;else&lt;/SPAN&gt; &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;
&lt;SPAN class="hljs-built_in"&gt;print&lt;/SPAN&gt;({&lt;SPAN class="hljs-string"&gt;"ttft_ms"&lt;/SPAN&gt;: ttft, &lt;SPAN class="hljs-string"&gt;"avg_tbt_ms"&lt;/SPAN&gt;: avg_tbt_ms})&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;
&lt;DIV class="go8b9g5 _7pq7t6ch"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="qt3gz91 paragraph"&gt;Use this for existing &lt;CODE class="qt3gz9f"&gt;chat.completions&lt;/CODE&gt; integrations; new agents should prefer &lt;CODE class="qt3gz9f"&gt;responses&lt;/CODE&gt;.&lt;/P&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;Logging TTFT/TBT within a ResponsesAgent implementation&lt;/H4&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you author agents in code, you can instrument your &lt;CODE class="qt3gz9f"&gt;predict_stream&lt;/CODE&gt; to compute and log TTFT/TBT and let &lt;STRONG&gt;MLflow Tracing&lt;/STRONG&gt; aggregate the full output for you.&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; time
&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; mlflow
&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; databricks.agents &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; ResponsesAgent, ResponsesAgentStreamEvent

&lt;SPAN class="hljs-keyword"&gt;class&lt;/SPAN&gt; &lt;SPAN class="hljs-title class_"&gt;MyAgent&lt;/SPAN&gt;(&lt;SPAN class="hljs-title class_ inherited__"&gt;ResponsesAgent&lt;/SPAN&gt;):
    &lt;SPAN class="hljs-keyword"&gt;def&lt;/SPAN&gt; &lt;SPAN class="hljs-title function_"&gt;predict_stream&lt;/SPAN&gt;(&lt;SPAN class="hljs-params"&gt;self, request&lt;/SPAN&gt;):
        t0 = time.perf_counter()
        first_time = &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;
        arrival_times = []

        &lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; stream_event &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; &lt;SPAN class="hljs-variable language_"&gt;self&lt;/SPAN&gt;.agent.stream(request.&lt;SPAN class="hljs-built_in"&gt;input&lt;/SPAN&gt;):  &lt;SPAN class="hljs-comment"&gt;# stream chunks from your underlying LLM/toolchain&lt;/SPAN&gt;
            now = time.perf_counter()
            &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; first_time &lt;SPAN class="hljs-keyword"&gt;is&lt;/SPAN&gt; &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;:
                first_time = now
                mlflow.log_metric(&lt;SPAN class="hljs-string"&gt;"ttft_ms"&lt;/SPAN&gt;, (first_time - t0) * &lt;SPAN class="hljs-number"&gt;1000&lt;/SPAN&gt;)

            arrival_times.append(now)
            &lt;SPAN class="hljs-keyword"&gt;yield&lt;/SPAN&gt; stream_event  &lt;SPAN class="hljs-comment"&gt;# forward the delta events&lt;/SPAN&gt;

        &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; &lt;SPAN class="hljs-built_in"&gt;len&lt;/SPAN&gt;(arrival_times) &amp;gt; &lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;:
            avg_tbt_ms = (&lt;SPAN class="hljs-built_in"&gt;sum&lt;/SPAN&gt;(arrival_times[i] - arrival_times[i-&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;] &lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; i &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; &lt;SPAN class="hljs-built_in"&gt;range&lt;/SPAN&gt;(&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;, &lt;SPAN class="hljs-built_in"&gt;len&lt;/SPAN&gt;(arrival_times))) / (&lt;SPAN class="hljs-built_in"&gt;len&lt;/SPAN&gt;(arrival_times)-&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;)) * &lt;SPAN class="hljs-number"&gt;1000&lt;/SPAN&gt;
            mlflow.log_metric(&lt;SPAN class="hljs-string"&gt;"avg_tbt_ms"&lt;/SPAN&gt;, avg_tbt_ms)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;
&lt;DIV class="go8b9g5 _7pq7t6ch"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="qt3gz91 paragraph"&gt;Streaming events are aggregated for display and tracing; this pattern lets you attach your own metrics cleanly.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Server-side observability and aggregation&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Enable &lt;STRONG&gt;MLflow Tracing in production&lt;/STRONG&gt; when deploying with Agent Framework; traces are logged to MLflow experiments and can be synced to Delta tables for monitoring. You’ll get latency, errors, and token usage, and can compute downstream analytics. TTFT/TBT remain client-derived unless you create custom attributes/spans for chunk timings.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Use &lt;STRONG&gt;Agent Evaluation (MLflow 2)&lt;/STRONG&gt; to see aggregated latency metrics over an evaluation set, and add &lt;STRONG&gt;custom metrics&lt;/STRONG&gt; if you want TTFT/TBT computed from traces you collect during streaming.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Guidance&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Prefer the &lt;STRONG&gt;Databricks OpenAI &lt;CODE class="qt3gz9f"&gt;responses&lt;/CODE&gt; client&lt;/STRONG&gt; for new agents and streaming metrics work; keep &lt;CODE class="qt3gz9f"&gt;chat.completions&lt;/CODE&gt; only for legacy code paths.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Use &lt;STRONG&gt;MLflow Tracing&lt;/STRONG&gt; everywhere so you can correlate your client-side TTFT/TBT with server-side spans, token usage, and cost; export traces via OpenTelemetry if you need them in your existing observability stack.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you need these metrics in dashboards, &lt;STRONG&gt;log them via &lt;CODE class="qt3gz9f"&gt;mlflow.log_metric&lt;/CODE&gt;&lt;/STRONG&gt; in your agent code and roll up by endpoint/version over time; combine with token usage for throughput and cost analyses.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;Hope this helps, Louis.&lt;/DIV&gt;</description>
      <pubDate>Sat, 08 Nov 2025 21:45:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/138228#M1361</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-11-08T21:45:45Z</dc:date>
    </item>
    <item>
      <title>Re: Measuring latency metrics like TTFT, TBT when deploying agents on Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/138338#M1367</link>
      <description>&lt;P&gt;Thanks for your response Louis. If I understand it correctly, for production monitoring, we would have to rely on client side logging. Can mlflow.log_metric be integrated with traces by any chance? (Since that seems to be the only way to measure TTFT on agent/server side)&lt;/P&gt;</description>
      <pubDate>Mon, 10 Nov 2025 06:15:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/138338#M1367</guid>
      <dc:creator>Rajat-TVSM</dc:creator>
      <dc:date>2025-11-10T06:15:20Z</dc:date>
    </item>
    <item>
      <title>Re: Measuring latency metrics like TTFT, TBT when deploying agents on Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/139713#M1427</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/160119"&gt;@Rajat-TVSM&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P class="qt3gz91 paragraph"&gt;You’re close: for production observability you can use server‑side tracing when you deploy agents on Databricks, and client‑side instrumentation when the app runs outside; you don’t have to rely only on the client side.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;What to use in production&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Databricks‑hosted agents (Agent Framework / Model Serving):&lt;/STRONG&gt; Traces are logged automatically to your MLflow experiment, and can be synced to Delta tables for monitoring and analysis. No extra client logging is required beyond enabling tracing in the agent code or endpoint config.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Apps deployed outside Databricks:&lt;/STRONG&gt; Use the lightweight mlflow‑tracing SDK to instrument your code and send traces to the Databricks MLflow server. This is the recommended pattern for “agent/server‑side” observability of external services.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;What tracing captures out of the box:&lt;/STRONG&gt; Inputs/outputs, intermediate steps (spans), token usage, and latency at each step, which is often enough for performance monitoring and cost tracking.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Can mlflow.log_metric be “integrated with” traces?&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Short answer:&lt;/STRONG&gt; mlflow.log_metric writes metrics to the active MLflow Run; &lt;STRONG&gt;traces&lt;/STRONG&gt; are a separate object with their own metadata and attributes. The recommended pattern is to log operational values (like TTFT) as a &lt;STRONG&gt;trace/span attribute&lt;/STRONG&gt; for per‑request analysis, and optionally also log the same value as an MLflow &lt;STRONG&gt;run metric&lt;/STRONG&gt; for aggregate dashboards.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Attaching data to traces:&lt;/STRONG&gt; Use &lt;CODE class="qt3gz9f"&gt;mlflow.update_current_trace(...)&lt;/CODE&gt; for trace‑level metadata/tags, or &lt;CODE class="qt3gz9f"&gt;span.set_attribute(...)&lt;/CODE&gt; for span‑level values (for example, custom latency counters).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Measuring TTFT (time‑to‑first‑token)&lt;/H3&gt;
&lt;P class="qt3gz91 paragraph"&gt;TTFT is specific to streaming. You can compute it in your server code the moment the first token is emitted and attach it to the current span, then optionally mirror it to run metrics:&lt;/P&gt;
&lt;DIV class="go8b9g1 _7pq7t6cl" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python qt3gz9e hljs language-python _1ymogdh2"&gt;&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; time
&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; mlflow
&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; mlflow.entities &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; SpanType

mlflow.set_experiment(&lt;SPAN class="hljs-string"&gt;"/Shared/prod-agent"&lt;/SPAN&gt;)
mlflow.openai.autolog()  &lt;SPAN class="hljs-comment"&gt;# or your provider autolog&lt;/SPAN&gt;

&lt;SPAN class="hljs-meta"&gt;@mlflow.trace(&lt;SPAN class="hljs-params"&gt;span_type=SpanType.CHAIN&lt;/SPAN&gt;)&lt;/SPAN&gt;
&lt;SPAN class="hljs-keyword"&gt;def&lt;/SPAN&gt; &lt;SPAN class="hljs-title function_"&gt;chat_stream&lt;/SPAN&gt;(&lt;SPAN class="hljs-params"&gt;prompt: &lt;SPAN class="hljs-built_in"&gt;str&lt;/SPAN&gt;&lt;/SPAN&gt;):
    span = mlflow.get_current_active_span()
    t0 = time.time()  &lt;SPAN class="hljs-comment"&gt;# request start&lt;/SPAN&gt;

    full_text = []
    first_token_ms = &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;

    &lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; token &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; stream_llm_tokens(prompt):  &lt;SPAN class="hljs-comment"&gt;# your streaming call&lt;/SPAN&gt;
        full_text.append(token)
        &lt;SPAN class="hljs-keyword"&gt;if&lt;/SPAN&gt; first_token_ms &lt;SPAN class="hljs-keyword"&gt;is&lt;/SPAN&gt; &lt;SPAN class="hljs-literal"&gt;None&lt;/SPAN&gt;:
            first_token_ms = &lt;SPAN class="hljs-built_in"&gt;int&lt;/SPAN&gt;((time.time() - t0) * &lt;SPAN class="hljs-number"&gt;1000&lt;/SPAN&gt;)
            &lt;SPAN class="hljs-comment"&gt;# Attach to the trace/span for per-request analysis&lt;/SPAN&gt;
            span.set_attribute(&lt;SPAN class="hljs-string"&gt;"llm.ttft_ms"&lt;/SPAN&gt;, first_token_ms)
            &lt;SPAN class="hljs-comment"&gt;# Optionally log to the active run for aggregation&lt;/SPAN&gt;
            mlflow.log_metric(&lt;SPAN class="hljs-string"&gt;"ttft_ms"&lt;/SPAN&gt;, first_token_ms)

    &lt;SPAN class="hljs-keyword"&gt;return&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;""&lt;/SPAN&gt;.join(full_text)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62y _7pq7t6cm _7pq7t6ay _7pq7t6bo"&gt;
&lt;DIV class="_17yk06p0"&gt;&lt;SPAN&gt;This follows the same approach as MLflow’s streaming examples (capture time context and update the active trace), and pairs with MLflow’s trace timing properties for overall latency.&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Querying TTFT later&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Per‑request:&lt;/STRONG&gt; Search traces by your custom attribute, e.g., &lt;CODE class="qt3gz9f"&gt;attributes.\&lt;/CODE&gt;llm.ttft_ms` &amp;gt; 500` to find slow first‑token cases.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Aggregate:&lt;/STRONG&gt; Use the run metric &lt;CODE class="qt3gz9f"&gt;ttft_ms&lt;/CODE&gt; for dashboards or time‑series views alongside other metrics.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Related capabilities you may want&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Automatic production tracing for Databricks deployments&lt;/STRONG&gt; (Agent Framework / Model Serving) with optional sync to Delta tables for monitoring pipelines.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Token usage and latency tracking&lt;/STRONG&gt; built into traces; compute cost from tokens.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Lakehouse Monitoring for GenAI&lt;/STRONG&gt; and evaluation pipelines consume traces to compute operational and quality metrics at scale.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;</description>
      <pubDate>Wed, 19 Nov 2025 18:55:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/measuring-latency-metrics-like-ttft-tbt-when-deploying-agents-on/m-p/139713#M1427</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-11-19T18:55:26Z</dc:date>
    </item>
  </channel>
</rss>

