cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
AlexMiller
Databricks Employee
Databricks Employee

The Gap in Agent Development

Databricks has made it straightforward to deploy AI agents—Model Serving endpoints, automatic MLflow tracing, Unity Catalog integration. But there's a gap between "deployed" and "production-ready":

You have:

  • Agent code running on a serving endpoint or locally tested
  • Traces accumulating in MLflow
  • Users hitting the endpoint

You don't have:

  • Systematic evaluation coverage
  • Custom scorers for your domain
  • A feedback loop from production back to development

MLflow 3's GenAI evaluation framework provides the primitives—scorers, datasets, evaluation runs—but assembling them manually doesn't scale. You need dozens of test cases, domain-specific scorers, and scripts that actually run against your Databricks environment.

This framework automates the assembly step. It treats production traces as first-class data, analyzes them to infer evaluation dimensions, generates runnable evaluation datasets and custom scorers, and outputs fully executable MLflow evaluation scripts. The result is not “auto-evaluation,” but a fast, reliable starting point—turning production behavior into a concrete evaluation loop that engineers can refine, extend, and operationalize.

Why Evaluation Matters for Production Agents

Building an agent that works in demos is straightforward. Building one that works reliably in production—handling edge cases, maintaining quality over time, and improving with each iteration—requires systematic evaluation.

The challenge for AI Engineers:

 

Phase

Without Evaluation

With Evaluation

Development

"It seems to work"

Quantified quality baselines

Iteration

"I think this is better"

Measured improvement (or regression)

Monitoring

"Users are complaining"

Automated quality gates

Where this framework bridges the gap: MLflow provides the evaluation primitives, and traces capture rich agent behavior—but assembling them into a runnable evaluation suite remains manual and time-consuming. This framework automates that assembly, converting traces into executable datasets, scorers, and scripts in minutes instead of days.

What Evaluation Requires for Production-Grade Agents

Effective evaluation is not a single metric or a one-time run—it is a system composed of multiple components that work together.

1. Evaluation Datasets: 

Evaluation datasets represent how users actually interact with the agent. These datasets can be built from a combination of:

  • Production traces, capturing real inputs, outputs, and execution context
  • Curated examples, targeting known edge cases or critical workflows
  • Synthetic cases, designed to probe unobserved or adversarial scenarios

Unlike traditional ML, agent evaluation datasets are often partially labeled. Rather than exact ground truth, they rely on expectations, guidelines, and constraints—making trace-derived data a particularly valuable foundation for realistic and scalable evaluation.

2. Scorers

Scorers judge agent responses against evaluation datasets. In practice, this usually involves a mix of:

  • Built-in checks (e.g., safety, relevance)
  • LLM-as-judge scorers for qualitative criteria
  • Programmatic scorers for structural or domain-specific rules
  • Human feedback to validate edge cases, calibrate scorers, and correct misaligned judgments

Scorers define what “good” means for an agent and are inherently application-specific.

3. Execution and Tracking

Evaluations must be runnable, repeatable, and tracked over time—logged with metrics and artifacts so regressions can be detected automatically. Without this, evaluation results remain anecdotal rather than actionable.

What MLflow 3 GenAI Evaluation Makes Possible

Before diving into the framework, it's worth understanding what MLflow 3 brings to the table.

Traces as first-class data

Every agent invocation on Databricks Model Serving captures a trace—inputs, outputs, latency, tool calls, retrieval results. These traces are queryable:

import mlflow

# Find slow traces from production
slow_traces = mlflow.search_traces(
    filter_string="attributes.execution_time_ms > 5000",
    experiment_ids=["prod-support-agent"]
)

# Find failures
errors = mlflow.search_traces(
    filter_string="attributes.status = 'ERROR'"
)

LLM-as-judge and Code-based scorers

MLflow 3 provides four levels of LLM-based evaluation-from zero-setup to fully custom:

from mlflow.genai.scorers import Safety, RelevanceToQuery, Correctness

# Ready to use immediately
scorers = [Safety(), RelevanceToQuery(), Correctness()]

# Guidelines lets you define natural language criteria:
from mlflow.genai.scorers import Guidelines

policy_scorer = Guidelines(
    name="policy_compliance",
    guidelines=[
        "Response must reference official policy documents",
        "Response must not promise exceptions to stated policies"
    ]
)

# make_judge gives full control for complex evaluation:
from mlflow.genai.judges import make_judge

resolution_judge = make_judge(
    name="issue_resolution",
    instructions="""
    Evaluate if the customer's issue was resolved.
    User's messages: {{ inputs }}
    Agent's responses: {{ outputs }}

    Respond with exactly one of:
    - 'fully_resolved'
    - 'partially_resolved'
    - 'needs_follow_up'
    """
)

# Custom scorers use pure Python for programmatic checks:
from mlflow.genai.scorers import scorer

@scorer
def contains_citation(outputs: str) -> str:
    # Return pass/fail string
    return "yes" if "[source]" in outputs else "no"

Evaluation datasets with expectations

Evaluation datasets can include expectations rather than exact labels, enabling partial supervision:

eval_data = [
    {
        "inputs": {"query": "What's your refund policy?"},
        "expectations": {
            "expected_facts": ["30-day window", "original payment method"],
            "guidelines": ["Must cite policy document"]
        }
    }
]

This structure makes it possible to evaluate qualitative behavior without requiring exhaustive ground truth.

The gap: MLflow 3 provides powerful primitives—but assembling them into comprehensive, domain-specific evaluation suites is still manual. Teams must analyze traces, design datasets, define scorers, and wire everything into runnable scripts. That overhead is exactly where autonomous generation becomes valuable.

The Reality: Evaluation Is Still Hard to Do Well

Despite having the right tools, evaluation remains one of the most under-invested parts of agent development.

The reason is familiar: evaluation is like unit tests or documentation. Everyone agrees it’s necessary. Few teams implement it thoroughly. Even fewer keep it up to date.

In practice, teams struggle with:

  • Extracting representative test cases from production behavior
  • Writing and maintaining domain-specific scorers
  • Wiring everything into runnable, repeatable evaluation pipelines

The effort is front-loaded, tedious, and easy to deprioritize—especially when agents appear to “work” in demos.

This is the gap the agent harness framework addresses by automating the most time-consuming parts of evaluation setup, while leaving judgment and refinement to engineers and subject-matter-experts.

The Agent Harness Framework — From Traces to Runnable Evaluation

The framework generates three artifacts from your MLflow traces:

 

Artifact

What it contains

How it's used

Evaluation Dataset

Test cases derived from production traces + synthetic edge cases

Input to mlflow.genai.evaluate()

Custom Scorers

Domain-specific LLM judges + programmatic checks

Passed to evaluation as scorer list

Run Script

Complete Python script targeting your Databricks workspace

Execute with python run_eval.py

 

Key constraint: Every generated artifact is validated before output. Scorers compile. Datasets match the expected schema. Scripts run without errors. No "recommendations"—only working code.

Dataset Strategy

The framework supports three strategies based on what you have:

 

Strategy

When to Use

Predict function (predict_fn) needed?

From Traces

Have existing traces, no agent code access

No — outputs pre-computed

Manual

Need curated edge cases, have agent callable

Yes — calls agent at eval time

Hybrid

Best coverage

Both approaches combined

Most deployments start with From Traces, evaluating responses the agent has already produced in development or production.

A Starting Point, Not a Replacement

This framework doesn't fully automate evaluation—and that's intentional. Evaluation requires human judgment about what matters for your specific use case.

What it provides:

  • A trace-derived baseline grounded in real production behavior
  • Automatically identified patterns and failure modes
  • Runnable evaluation code from day one

What remains human-driven:

  • Refining test cases and scorer criteria
  • Adding unobserved edge cases
  • Iterating based on evaluation results

The outcome is a continuous loop: traces → generated evaluation → review → improved evaluation → agent changes → new traces. Each cycle increases coverage and reliability.

Agent Architecture — The Agent Harness Pattern

Long-running generation tasks fail in predictable ways: agents accumulate context, attempt to solve too much in a single pass, or terminate before work is complete.

This framework follows Anthropic's research on effective harnesses for long-running agents by separating initialization from incremental execution, using short-lived sessions with fresh context and shared, persistent state.

An agent harness is the orchestration layer around the model that manages what the LLM cannot: task decomposition, session boundaries, state handoff, and failure recovery. Instead of relying on a single long-running agent, the harness executes work in bounded sessions with check-pointed state, enabling reliable progress and safe retries.

 

Built on Claude Agent SDK

The framework is built using the Claude Agent SDK, Anthropic's framework for building autonomous agents that can use tools, maintain state across sessions, and execute multi-step tasks. Claude Agent SDK treats the file system as the source of truth. Tasks are executed by inspecting existing files, modifying them incrementally, validating outputs, and persisting state explicitly—mirroring how a human engineer would work.

Why Claude Agent SDK?

Capability

How it’s used

Tool Integration

MCP tools for MLflow trace queries, experiment metadata, annotations, and metrics

File System Access

Read/write evaluation artifacts; use tools like ls, grep, diff, and structured edits to iteratively build and validate code

Session Management

Fresh model context per session with file-based state handoff

Skills System

Load verified MLflow 3.x API patterns and known gotchas before code generation

Streaming Execution

Real-time progress visibility during long-running generation and validation

Databricks Integration: All inference runs through Claude Opus 4.5 via the Databricks Foundation Model API, keeping execution and data within the workspace boundary. Skills files, prompts, data, and intermediate state are stored in Unity Catalog Volumes, making the system auditable, reproducible, and fully governed by Databricks permissions.

This framework focuses on generating trace-driven evaluation infrastructure, not improving agent behavior directly, and is complementary to tools like Agent Bricks Learning from Human Feedback and DSPy.

Architecture Overview

architecture-overview.png

 

Agent Flow

agent-flow-simplified.png

 

Why Sessions, Not Agents?

Each session runs with fresh model context while reading and writing shared, file-backed state. This avoids the common failure mode of long-running agents—accumulated errors, drifting context, and brittle retries—while preserving continuity through explicit state handoff.

The initializer session analyzes traces and produces a concrete task plan. Each worker session then executes a single task, persists results, and exits. If validation fails, a “fix task” is added and picked up by a subsequent session. Progress is incremental, explicit, and restartable by design.

Why Databricks Jobs?

This session-based architecture aligns naturally with Databricks Jobs, not interactive notebooks:

 

Aspect

Interactive Notebook

Databricks Job

Context

State accumulates across cells

Fresh environment per run

Execution

Requires manual oversight

Runs unattended

Failure Recovery

Manual restart

Automatic retry from persisted state

Scheduling

Ad hoc

Scheduled or event-triggered

Monitoring

Console output

Job UI, alerts, MLflow traces

Because each session is stateless beyond what’s written to UC Volumes, retries are safe and deterministic. A failed task can be re-run without replaying the entire workflow.

Typical workflow:

  1. Schedule a Databricks Job to run nightly or weekly against a MLflow experiment
  2. The initializer analyzes new traces and updates evaluation coverage
  3. Worker sessions generate and validate updated artifacts
  4. Engineers review outputs in a Unity Catalog Volume and refine scorers as needed

Trace Analysis: The Starting Point

The initializer session does not invent test cases—it derives them directly from observed behavior captured in MLflow traces.

Below is a representative excerpt from a real multi-agent deployment:

{
  "agent_type": "Multi-Genie Orchestrator - A LangGraph-based supervisor
                 that routes business questions to specialized Databricks
                 Genie agents",

  "trace_summary": {
    "total_analyzed": 30,
    "success_count": 25,
    "error_count": 5,
    "avg_latency_ms": 27000
  },

  "query_types_observed": [
    {"type": "sales_pipeline", "routing": "customer_sales"},
    {"type": "supply_chain", "routing": "supply_chain"},
    {"type": "system_info", "routing": "none (supervisor handles directly)"}
  ],

  "optimization_opportunities": [
    {
      "issue": "Suboptimal routing",
      "description": "Supervisor sometimes calls both genies when only one is needed",
      "impact": "5+ seconds wasted per unnecessary genie call"
    }
  ]
}

Because inefficient routing is visible in production traces, the framework generates a dedicated efficient_routing scorer to measure and prevent this behavior.

Initializer Session Output (Summary)

After analyzing traces, the initializer produces a concrete evaluation plan:

Agent Understanding:

  • Type: Multi-Agent Supervisor (LangGraph-based)
  • Purpose: Orchestrates specialized "Genie" agents to answer business intelligence queries
  • Domains: Sales pipeline (customer_sales) and Supply chain (supply_chain)
  • Model: Claude 3.7 Sonnet via Databricks
  • I/O: Takes {query: "..."}, returns {response: "..."} with markdown tables

Evaluation Dimensions Identified:

Scorer

Type

Rationale

Safety

Builtin

Required baseline - no harmful content

RelevanceToQuery

Builtin

Responses must address business questions

correct_routing

Guidelines

Route to appropriate Genie based on query type

efficient_routing

Guidelines

Don't call unnecessary Genies

data_presentation

Guidelines

Tables must be clear and well-formatted

error_handling

Custom

Validate error response structure

Dataset Strategy: traces (no predict_fn needed)

  • Extract inputs AND outputs from existing production traces
  • Evaluate pre-computed responses directly

Task Plan Created (eval_tasks.json)

  1. Build evaluation dataset - Extract from 5 sample traces
  2. Create scorers - 6 scorers (2 builtin, 3 guidelines, 1 custom)
  3. Generate eval script - mlflow.genai.evaluate() with pre-computed outputs
  4. Run and validate - Execute and verify metrics logged

Output Files:

  • eval_tasks.json - Task list for worker sessions
  • state/analysis.json - Trace analysis with 20 OK, 5 ERROR traces

 

Generated Evaluation Dataset 

# From sessions/2025-12-19_093956/evaluation/eval_dataset.py
EVAL_DATA = [
    # Sales Pipeline Query - should route to customer_sales only
    # Status: OK but SUBOPTIMAL - called both genies unnecessarily
    {
        "inputs": {
            "query": "How's my pipeline just in the americas and by segment?"
        },
        "outputs": {
            "response": """Sales Pipeline for the Americas by Segment:

|    | company_size_segment__c   |   sum(pipeline_amount) |
|---:|:--------------------------|-----------------------:|
|  0 | ENT                       |                15999.6 |
|  1 | MM                        |                18003.1 |
|  2 | SMB                       |                26001.2 |"""
        },
        "metadata": {
            "source_trace": "tr-d45af7104661af36e7ccd1e81a76f2d8",
            "query_type": "sales_pipeline",
            "expected_routing": "customer_sales",
            "actual_routing": "customer_sales, supply_chain",
            "routing_optimal": False
        }
    },
    # ... more test cases
]

Generated Scorers

Scorers tailored to your agent's domain, derived from trace analysis:

# From sessions/2025-12-19_093956/evaluation/scorers.py
from mlflow.genai.scorers import Guidelines, Safety, RelevanceToQuery

# Built-in scorers
safety_scorer = Safety()
relevance_scorer = RelevanceToQuery()

# Domain-specific: Correct routing for multi-agent orchestration
correct_routing_scorer = Guidelines(
    name="correct_routing",
    guidelines="""The agent should correctly route questions:
    1. Sales questions → customer_sales Genie
    2. Supply chain questions → supply_chain Genie
    3. System questions → Answer directly (no Genie call)
    4. Out-of-scope → Decline gracefully"""
)

# Performance: Identified from trace analysis showing suboptimal routing
efficient_routing_scorer = Guidelines(
    name="efficient_routing",
    guidelines="""The agent should minimize unnecessary Genie calls:
    1. Pure sales questions: Only customer_sales Genie
    2. Pure supply chain questions: Only supply_chain Genie
    3. Only multi-domain questions should call both Genies

    Unnecessary calls waste 5+ seconds each."""
)

def get_all_scorers():
    return [
        safety_scorer,
        relevance_scorer,
        correct_routing_scorer,
        efficient_routing_scorer,
        # ... additional scorers
    ]

Evaluation Results

After six sessions (one initializer, five workers), the evaluation suite executes successfully. The framework validates everything before declaring success:

{
  "validation_status": "passed",
  "metrics": {
    "Safety/mean": 1.0,
    "RelevanceToQuery/mean": 0.7,
    "correct_routing/mean": 0.5,
    "efficient_routing/mean": 0.9
  },
  "validation_checks": {
    "script_executes": true,
    "all_scorers_returned_values": true,
    "no_nan_scores": true
  }
}

Evaluation results in MLflow UI:

Screenshot 2026-01-06 at 10.33.54 AM 2.png

 Providing the Agent with Skills - Verified API Patterns

Generated code fails most often when it targets outdated or incorrect APIs. MLflow’s GenAI evaluation interfaces have evolved rapidly, and examples from older blog posts or Q&A sites are frequently wrong.

To prevent this, the framework uses Anthropic’s Skills mechanism—a way to provide agents with verified, task-specific knowledge before they write any code (see Anthropic’s announcement: “Equipping agents for the real world with Agent Skills”). Skills act as a constrained reference layer, ensuring the agent follows known-correct patterns rather than relying on training-time assumptions.

The framework loads Skills files containing verified MLflow 3.1+ interfaces before generating any evaluation artifacts (see GitHub repo for more details on SKILLS here).

Common Mistakes Prevented

From .claude/skills/mlflow-evaluation/references/GOTCHAS.md:

 

Common Mistake

Correct Pattern

mlflow.evaluate()

mlflow.genai.evaluate()

{"query": "..."} (flat)

{"inputs": {"query": "..."}} (nested)

def predict_fn(inputs):

def predict_fn(**inputs): (unpacked kwargs)

Guidelines(guidelines="...")

Guidelines(name="...", guidelines="...")

from mlflow.metrics

from mlflow.genai.scorers

Verified Interfaces

From CRITICAL-interfaces.md:

# Scorer function signature (verified for MLflow 3.1+)
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

@scorer
def my_scorer(
    inputs: dict,       # What was sent to the agent
    outputs: dict,      # What the agent returned
    expectations: dict  # Ground truth (optional)
) -> Feedback | bool | int | float:
    return Feedback(value=True, rationale="Explanation")

By loading these Skills up front, the agent generates evaluation code that matches the current MLflow API, not outdated patterns from its training data.

Why This Matters

Without verified interfaces, generated code often fails on first run:

  • Wrong import paths (MLflow 2.x vs 3.x)
  • Missing required parameters (Guidelines needs both name and guidelines)
  • Wrong function signatures (predict_fn receives unpacked kwargs)

The Skills system catches these at generation time, not runtime.

Getting Started on Databricks

This framework is designed to run both locally, interactively in a notebook, and as a Databricks Job. Full setup instructions, configuration details, and examples are documented in the GitHub repository:

GitHub: https://github.com/alexmillerdb/mlflow-eval-agent

Prerequisites (Summary)

  • Databricks workspace
  • MLflow 3.1+
  • An agent with traces logged to an MLflow experiment

How It Runs (At a Glance)

  1. The initializer session analyzes production traces and creates a task plan
  2. Worker sessions generate the evaluation dataset, scorers, and runnable script
  3. A validation step executes the evaluation and verifies results
  4. Outputs are written to sessions/{timestamp}/evaluation/

The generated evaluation can be run directly against your Databricks workspace, with results logged back to MLflow for tracking and comparison.

For step-by-step instructions—local development, Databricks deployment, and job configuration—refer to the repository README.

Summary

Aspect

Manual Approach

This Framework

Time to first evaluation suite

1-3 days

< 2 hours

Test cases derived from production

Rarely done

Standard

Scorers that compile on first run

~60%

>95%

Coverage of actual failure modes

Ad-hoc

Systematic

The value isn’t just speed—it’s a repeatable signal. By removing the front-loaded, manual work that makes evaluation hard to sustain, MLflow traces can reliably feed evaluation, evaluation can surface real gaps and regressions, and engineers can focus on deciding what to fix or improve next.

By turning evaluation into trace-driven infrastructure rather than an ad-hoc task, this framework makes systematic measurement of agent behavior practical in production - using MLflow and Databricks primitives teams already rely on.