cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Debu-Sinha
Databricks Employee
Databricks Employee

Hi, I’m Debu. I spend a lot of my day building and stress‑testing LLM‑powered systems, and one lesson keeps coming back: if you don’t measure your agent’s behavior over an entire conversation, you’re flying blind. Below is the exact notebook pattern I use on Databricks to score a chatbot’s performance turn‑by‑turn and track those numbers over time.

Why Multi‑Turn Evaluation?

Real users don’t stop after one message. They change topics, ask follow‑ups, and expect the bot to remember context. A single‑turn test can’t surface issues like:

  • losing track of earlier instructions,

  • contradicting itself three turns later, or

  • drifting into unsafe territory when the dialog gets longer.

That’s why every example you’ll see here treats the conversation as a list of messages—not isolated prompts.

What You’ll Build

  1. Set up the notebook environment with the databricks-sdk and databricks-agents libraries.

  2. Create a small multi‑turn eval set—two dialogs, each with its own rubric.

  3. Write a tiny rule‑based agent (swap it for your real model later) and wrap it in mlflow.trace.

  4. Define global guidelines for helpfulness, clarity, and safety.

  5. Run mlflow.evaluate with the built‑in Databricks agent grader.

  6. Store the scores in Delta so you can watch trends and catch regressions.

1  — Prerequisites & Environment Setup

I’m using DBR 14.3 LTS and MLflow ≥ 2.12. Install the extras and restart Python so Databricks picks them up:

%pip install databricks-sdk databricks-agents
dbutils.library.restartPython()

Then pull in the usual suspects:

import mlflow
from mlflow.deployments import get_deploy_client  # optional for prod deploys
import pandas as pd

2  — Crafting a Multi‑Turn Evaluation Dataset

Each row holds the full conversation so far plus the grading guidelines for the next response:

eval_set = [
    {
        "request": {
            "messages": [
                {"role": "user", "content": "Hi"},
                {"role": "assistant", "content": "Hello! How can I help you today?"},
                {"role": "user", "content": "Tell me a joke"}
            ]
        },
        "guidelines": [
            "The response should be humorous but appropriate",
            "The response should be concise"
        ]
    },
    {
        "request": {
            "messages": [
                {"role": "user", "content": "What's the weather like?"},
                {"role": "assistant", "content": "I don't have real‑time weather data. You'd need to check a weather service for that information."},
                {"role": "user", "content": "Can you explain how LLMs work?"}
            ]
        },
        "guidelines": [
            "The response should be technical but accessible",
            "The response should include a brief explanation of attention mechanisms"
        ]
    }
]

# Convert to a DataFrame so mlflow.evaluate can treat it as a table‑like object
eval_df = pd.DataFrame(eval_set)

I keep it in Pandas because it’s easy to version and quick to inspect.

3  — Implementing a Simple Agent (Demo Only)

Here’s a throw‑away rule‑based agent. The only thing that matters is the function signaturemessages comes in as the full dialog.

@mlflow.trace(span_type="AGENT")
def my_agent(messages):
    """A trivial rule‑based agent for illustration."""
    last_user_message = next((m["content"] for m in reversed(messages) if m["role"] == "user"), "")

    if "joke" in last_user_message.lower():
        return "Why did the AI go to art school? To learn how to draw conclusions!"
    elif "weather" in last_user_message.lower():
        return "I don't have access to real‑time weather data, but I can help you understand weather patterns in general."
    elif any(term in last_user_message.lower() for term in ["llm", "language model"]):
        return (
            "Large Language Models (LLMs) are AI systems trained on vast amounts of text data. "
            "They use transformer architectures with attention mechanisms to model relationships between tokens."
        )
    else:
        return f\"I understand you asked about: '{last_user_message}'. How can I help with that?\"

mlflow.trace gives me latency and nested‑call traces for free—handy once I replace this with a real chain‑of‑thought or RAG pipeline.

4  — Global Guidelines

I add a second layer of checks that apply to every row:

global_guidelines = {
    "helpfulness": ["The response must be helpful and directly address the user's question"],
    "clarity": ["The response must be clear and well‑structured"],
    "safety": ["The response must be safe and appropriate"]
}

5  — Running the Evaluation

Now let’s grade the agent. Everything lives inside an MLflow run so I can track it later:

with mlflow.start_run(run_name="agent_evaluation_v1") as run:
    evaluation_results = mlflow.evaluate(
        data=eval_df,
        model=lambda request: my_agent(**request),
        model_type="databricks-agent",
        evaluator_config={
            "databricks-agent": {
                "global_guidelines": global_guidelines
            }
        }
    )

Under the hood, Databricks calls proprietary expert LLM judges and returns scores like helpfulness_score, clarity_score, and safety_score.

6  — Inspecting the Results

print("Aggregated metrics:\n", evaluation_results.metrics)
per_request_results = evaluation_results.tables["eval_results"]
print("\nPer‑request results:\n", per_request_results)

Need a quick visual? In a notebook just run:

display(per_request_results)

Aggregates catch regressions; per‑request rows tell me exactly which turn broke the guideline.

7  — Persisting Metrics to Delta

I push every run into Delta so I can chart trends and set alerts:

def append_metrics_to_table(run_name, mlflow_metrics, delta_table_name):
    data = {k: v for k, v in mlflow_metrics.items() if "error_count" not in k}
    data.update({"run_name": run_name, "timestamp": pd.Timestamp.now()})

    (spark.createDataFrame([data])
        .write.mode("append")
        .saveAsTable(delta_table_name))

# append_metrics_to_table("agent_evaluation_v1", evaluation_results.metrics, "catalog.schema.agent_eval_results")

Hook this into DLT or a Databricks SQL dashboard and you’ve got continuous monitoring.

Wrap‑Up & Next Steps

You now have a repeatable, multi‑turn evaluation loop that is:

  • Reproducible – every run is logged in MLflow.

  • Swappable – replace the toy agent with your production model; the harness stays the same.

  • Observable – metrics and traces live in Delta and MLflow for long‑term visibility.

Where I go from here

  • Expand the eval set with adversarial prompts and longer chats.

  • Track latency, token usage, and cost alongside quality scores.

  • Wire this into CI/CD—block model promotion if helpfulness drops.

  • Store everything in Unity Catalog for governance.

Embed these checks early and your agents will improve with every release—no surprises when they hit real users. If you have questions or tweaks, ping me anytime. Happy shipping!