Hi, I’m Debu. I spend a lot of my day building and stress‑testing LLM‑powered systems, and one lesson keeps coming back: if you don’t measure your agent’s behavior over an entire conversation, you’re flying blind. Below is the exact notebook pattern I use on Databricks to score a chatbot’s performance turn‑by‑turn and track those numbers over time.
Real users don’t stop after one message. They change topics, ask follow‑ups, and expect the bot to remember context. A single‑turn test can’t surface issues like:
That’s why every example you’ll see here treats the conversation as a list of messages—not isolated prompts.
I’m using DBR 14.3 LTS and MLflow ≥ 2.12. Install the extras and restart Python so Databricks picks them up:
%pip install databricks-sdk databricks-agents
dbutils.library.restartPython()
Then pull in the usual suspects:
import mlflow
from mlflow.deployments import get_deploy_client # optional for prod deploys
import pandas as pd
Each row holds the full conversation so far plus the grading guidelines for the next response:
eval_set = [
{
"request": {
"messages": [
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello! How can I help you today?"},
{"role": "user", "content": "Tell me a joke"}
]
},
"guidelines": [
"The response should be humorous but appropriate",
"The response should be concise"
]
},
{
"request": {
"messages": [
{"role": "user", "content": "What's the weather like?"},
{"role": "assistant", "content": "I don't have real‑time weather data. You'd need to check a weather service for that information."},
{"role": "user", "content": "Can you explain how LLMs work?"}
]
},
"guidelines": [
"The response should be technical but accessible",
"The response should include a brief explanation of attention mechanisms"
]
}
]
# Convert to a DataFrame so mlflow.evaluate can treat it as a table‑like object
eval_df = pd.DataFrame(eval_set)
I keep it in Pandas because it’s easy to version and quick to inspect.
Here’s a throw‑away rule‑based agent. The only thing that matters is the function signature—messages comes in as the full dialog.
@mlflow.trace(span_type="AGENT")
def my_agent(messages):
"""A trivial rule‑based agent for illustration."""
last_user_message = next((m["content"] for m in reversed(messages) if m["role"] == "user"), "")
if "joke" in last_user_message.lower():
return "Why did the AI go to art school? To learn how to draw conclusions!"
elif "weather" in last_user_message.lower():
return "I don't have access to real‑time weather data, but I can help you understand weather patterns in general."
elif any(term in last_user_message.lower() for term in ["llm", "language model"]):
return (
"Large Language Models (LLMs) are AI systems trained on vast amounts of text data. "
"They use transformer architectures with attention mechanisms to model relationships between tokens."
)
else:
return f\"I understand you asked about: '{last_user_message}'. How can I help with that?\"
mlflow.trace gives me latency and nested‑call traces for free—handy once I replace this with a real chain‑of‑thought or RAG pipeline.
I add a second layer of checks that apply to every row:
global_guidelines = {
"helpfulness": ["The response must be helpful and directly address the user's question"],
"clarity": ["The response must be clear and well‑structured"],
"safety": ["The response must be safe and appropriate"]
}
Now let’s grade the agent. Everything lives inside an MLflow run so I can track it later:
with mlflow.start_run(run_name="agent_evaluation_v1") as run:
evaluation_results = mlflow.evaluate(
data=eval_df,
model=lambda request: my_agent(**request),
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
"global_guidelines": global_guidelines
}
}
)
Under the hood, Databricks calls proprietary expert LLM judges and returns scores like helpfulness_score, clarity_score, and safety_score.
print("Aggregated metrics:\n", evaluation_results.metrics)
per_request_results = evaluation_results.tables["eval_results"]
print("\nPer‑request results:\n", per_request_results)
Need a quick visual? In a notebook just run:
display(per_request_results)
Aggregates catch regressions; per‑request rows tell me exactly which turn broke the guideline.
I push every run into Delta so I can chart trends and set alerts:
def append_metrics_to_table(run_name, mlflow_metrics, delta_table_name):
data = {k: v for k, v in mlflow_metrics.items() if "error_count" not in k}
data.update({"run_name": run_name, "timestamp": pd.Timestamp.now()})
(spark.createDataFrame([data])
.write.mode("append")
.saveAsTable(delta_table_name))
# append_metrics_to_table("agent_evaluation_v1", evaluation_results.metrics, "catalog.schema.agent_eval_results")
Hook this into DLT or a Databricks SQL dashboard and you’ve got continuous monitoring.
You now have a repeatable, multi‑turn evaluation loop that is:
Embed these checks early and your agents will improve with every release—no surprises when they hit real users. If you have questions or tweaks, ping me anytime. Happy shipping!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.