Databricks Community

Debu-Sinha · ‎04-08-2025

Hi, I’m Debu. I spend a lot of my day building and stress‑testing LLM‑powered systems, and one lesson keeps coming back: if you don’t measure your agent’s behavior over an entire conversation, you’re flying blind. Below is the exact notebook pattern I use on Databricks to score a chatbot’s performance turn‑by‑turn and track those numbers over time.

Why Multi‑Turn Evaluation?

Real users don’t stop after one message. They change topics, ask follow‑ups, and expect the bot to remember context. A single‑turn test can’t surface issues like:

losing track of earlier instructions,
contradicting itself three turns later, or
drifting into unsafe territory when the dialog gets longer.

That’s why every example you’ll see here treats the conversation as a list of messages—not isolated prompts.

What You’ll Build

Set up the notebook environment with the databricks-sdk and databricks-agents libraries.
Create a small multi‑turn eval set—two dialogs, each with its own rubric.
Write a tiny rule‑based agent (swap it for your real model later) and wrap it in mlflow.trace.
Define global guidelines for helpfulness, clarity, and safety.
Run mlflow.evaluate with the built‑in Databricks agent grader.
Store the scores in Delta so you can watch trends and catch regressions.

1  — Prerequisites & Environment Setup

I’m using DBR 14.3 LTS and MLflow ≥ 2.12. Install the extras and restart Python so Databricks picks them up:

%pip install databricks-sdk databricks-agents
dbutils.library.restartPython()

Then pull in the usual suspects:

import mlflow
from mlflow.deployments import get_deploy_client  # optional for prod deploys
import pandas as pd

2  — Crafting a Multi‑Turn Evaluation Dataset

Each row holds the full conversation so far plus the grading guidelines for the next response:

eval_set = [
    {
        "request": {
            "messages": [
                {"role": "user", "content": "Hi"},
                {"role": "assistant", "content": "Hello! How can I help you today?"},
                {"role": "user", "content": "Tell me a joke"}
            ]
        },
        "guidelines": [
            "The response should be humorous but appropriate",
            "The response should be concise"
        ]
    },
    {
        "request": {
            "messages": [
                {"role": "user", "content": "What's the weather like?"},
                {"role": "assistant", "content": "I don't have real‑time weather data. You'd need to check a weather service for that information."},
                {"role": "user", "content": "Can you explain how LLMs work?"}
            ]
        },
        "guidelines": [
            "The response should be technical but accessible",
            "The response should include a brief explanation of attention mechanisms"
        ]
    }
]

# Convert to a DataFrame so mlflow.evaluate can treat it as a table‑like object
eval_df = pd.DataFrame(eval_set)

I keep it in Pandas because it’s easy to version and quick to inspect.

3  — Implementing a Simple Agent (Demo Only)

Here’s a throw‑away rule‑based agent. The only thing that matters is the function signature—messages comes in as the full dialog.

@mlflow.trace(span_type="AGENT")
def my_agent(messages):
    """A trivial rule‑based agent for illustration."""
    last_user_message = next((m["content"] for m in reversed(messages) if m["role"] == "user"), "")

    if "joke" in last_user_message.lower():
        return "Why did the AI go to art school? To learn how to draw conclusions!"
    elif "weather" in last_user_message.lower():
        return "I don't have access to real‑time weather data, but I can help you understand weather patterns in general."
    elif any(term in last_user_message.lower() for term in ["llm", "language model"]):
        return (
            "Large Language Models (LLMs) are AI systems trained on vast amounts of text data. "
            "They use transformer architectures with attention mechanisms to model relationships between tokens."
        )
    else:
        return f\"I understand you asked about: '{last_user_message}'. How can I help with that?\"

mlflow.trace gives me latency and nested‑call traces for free—handy once I replace this with a real chain‑of‑thought or RAG pipeline.

4  — Global Guidelines

I add a second layer of checks that apply to every row:

global_guidelines = {
    "helpfulness": ["The response must be helpful and directly address the user's question"],
    "clarity": ["The response must be clear and well‑structured"],
    "safety": ["The response must be safe and appropriate"]
}

5  — Running the Evaluation

Now let’s grade the agent. Everything lives inside an MLflow run so I can track it later:

with mlflow.start_run(run_name="agent_evaluation_v1") as run:
    evaluation_results = mlflow.evaluate(
        data=eval_df,
        model=lambda request: my_agent(**request),
        model_type="databricks-agent",
        evaluator_config={
            "databricks-agent": {
                "global_guidelines": global_guidelines
            }
        }
    )

Under the hood, Databricks calls proprietary expert LLM judges and returns scores like helpfulness_score, clarity_score, and safety_score.

6  — Inspecting the Results

print("Aggregated metrics:\n", evaluation_results.metrics)
per_request_results = evaluation_results.tables["eval_results"]
print("\nPer‑request results:\n", per_request_results)

Need a quick visual? In a notebook just run:

display(per_request_results)

Aggregates catch regressions; per‑request rows tell me exactly which turn broke the guideline.

7  — Persisting Metrics to Delta

I push every run into Delta so I can chart trends and set alerts:

def append_metrics_to_table(run_name, mlflow_metrics, delta_table_name):
    data = {k: v for k, v in mlflow_metrics.items() if "error_count" not in k}
    data.update({"run_name": run_name, "timestamp": pd.Timestamp.now()})

    (spark.createDataFrame([data])
        .write.mode("append")
        .saveAsTable(delta_table_name))

# append_metrics_to_table("agent_evaluation_v1", evaluation_results.metrics, "catalog.schema.agent_eval_results")

Hook this into DLT or a Databricks SQL dashboard and you’ve got continuous monitoring.

Wrap‑Up & Next Steps

You now have a repeatable, multi‑turn evaluation loop that is:

Reproducible – every run is logged in MLflow.
Swappable – replace the toy agent with your production model; the harness stays the same.
Observable – metrics and traces live in Delta and MLflow for long‑term visibility.

Where I go from here

Expand the eval set with adversarial prompts and longer chats.
Track latency, token usage, and cost alongside quality scores.
Wire this into CI/CD—block model promotion if helpfulness drops.
Store everything in Unity Catalog for governance.

Embed these checks early and your agents will improve with every release—no surprises when they hit real users. If you have questions or tweaks, ping me anytime. Happy shipping!

Databricks Community

Evaluate Multi‑Turn Chatbots on Databricks with MLflow: A Step‑by‑Step Guide

Why Multi‑Turn Evaluation?

What You’ll Build

1  — Prerequisites & Environment Setup

2  — Crafting a Multi‑Turn Evaluation Dataset

3  — Implementing a Simple Agent (Demo Only)

4  — Global Guidelines

5  — Running the Evaluation

6  — Inspecting the Results

7  — Persisting Metrics to Delta

Wrap‑Up & Next Steps

Where I go from here

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks

Databricks Community

Evaluate Multi‑Turn Chatbots on Databricks with MLflow: A Step‑by‑Step Guide

Why Multi‑Turn Evaluation?

What You’ll Build

1 — Prerequisites & Environment Setup

2 — Crafting a Multi‑Turn Evaluation Dataset

3 — Implementing a Simple Agent (Demo Only)

4 — Global Guidelines

5 — Running the Evaluation

6 — Inspecting the Results

7 — Persisting Metrics to Delta

Wrap‑Up & Next Steps

Where I go from here

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks

1  — Prerequisites & Environment Setup

2  — Crafting a Multi‑Turn Evaluation Dataset

3  — Implementing a Simple Agent (Demo Only)

4  — Global Guidelines

5  — Running the Evaluation

6  — Inspecting the Results

7  — Persisting Metrics to Delta