Databricks has made it straightforward to deploy AI agents—Model Serving endpoints, automatic MLflow tracing, Unity Catalog integration. But there's a gap between "deployed" and "production-ready":
You have:
You don't have:
MLflow 3's GenAI evaluation framework provides the primitives—scorers, datasets, evaluation runs—but assembling them manually doesn't scale. You need dozens of test cases, domain-specific scorers, and scripts that actually run against your Databricks environment.
This framework automates the assembly step. It treats production traces as first-class data, analyzes them to infer evaluation dimensions, generates runnable evaluation datasets and custom scorers, and outputs fully executable MLflow evaluation scripts. The result is not “auto-evaluation,” but a fast, reliable starting point—turning production behavior into a concrete evaluation loop that engineers can refine, extend, and operationalize.
Building an agent that works in demos is straightforward. Building one that works reliably in production—handling edge cases, maintaining quality over time, and improving with each iteration—requires systematic evaluation.
The challenge for AI Engineers:
|
Phase |
Without Evaluation |
With Evaluation |
|---|---|---|
|
Development |
"It seems to work" |
Quantified quality baselines |
|
Iteration |
"I think this is better" |
Measured improvement (or regression) |
|
Monitoring |
"Users are complaining" |
Automated quality gates |
Where this framework bridges the gap: MLflow provides the evaluation primitives, and traces capture rich agent behavior—but assembling them into a runnable evaluation suite remains manual and time-consuming. This framework automates that assembly, converting traces into executable datasets, scorers, and scripts in minutes instead of days.
Effective evaluation is not a single metric or a one-time run—it is a system composed of multiple components that work together.
Evaluation datasets represent how users actually interact with the agent. These datasets can be built from a combination of:
Unlike traditional ML, agent evaluation datasets are often partially labeled. Rather than exact ground truth, they rely on expectations, guidelines, and constraints—making trace-derived data a particularly valuable foundation for realistic and scalable evaluation.
Scorers judge agent responses against evaluation datasets. In practice, this usually involves a mix of:
Scorers define what “good” means for an agent and are inherently application-specific.
Evaluations must be runnable, repeatable, and tracked over time—logged with metrics and artifacts so regressions can be detected automatically. Without this, evaluation results remain anecdotal rather than actionable.
Before diving into the framework, it's worth understanding what MLflow 3 brings to the table.
Every agent invocation on Databricks Model Serving captures a trace—inputs, outputs, latency, tool calls, retrieval results. These traces are queryable:
import mlflow
# Find slow traces from production
slow_traces = mlflow.search_traces(
filter_string="attributes.execution_time_ms > 5000",
experiment_ids=["prod-support-agent"]
)
# Find failures
errors = mlflow.search_traces(
filter_string="attributes.status = 'ERROR'"
)
MLflow 3 provides four levels of LLM-based evaluation-from zero-setup to fully custom:
from mlflow.genai.scorers import Safety, RelevanceToQuery, Correctness
# Ready to use immediately
scorers = [Safety(), RelevanceToQuery(), Correctness()]
# Guidelines lets you define natural language criteria:
from mlflow.genai.scorers import Guidelines
policy_scorer = Guidelines(
name="policy_compliance",
guidelines=[
"Response must reference official policy documents",
"Response must not promise exceptions to stated policies"
]
)
# make_judge gives full control for complex evaluation:
from mlflow.genai.judges import make_judge
resolution_judge = make_judge(
name="issue_resolution",
instructions="""
Evaluate if the customer's issue was resolved.
User's messages: {{ inputs }}
Agent's responses: {{ outputs }}
Respond with exactly one of:
- 'fully_resolved'
- 'partially_resolved'
- 'needs_follow_up'
"""
)
# Custom scorers use pure Python for programmatic checks:
from mlflow.genai.scorers import scorer
@scorer
def contains_citation(outputs: str) -> str:
# Return pass/fail string
return "yes" if "[source]" in outputs else "no"
Evaluation datasets can include expectations rather than exact labels, enabling partial supervision:
eval_data = [
{
"inputs": {"query": "What's your refund policy?"},
"expectations": {
"expected_facts": ["30-day window", "original payment method"],
"guidelines": ["Must cite policy document"]
}
}
]
This structure makes it possible to evaluate qualitative behavior without requiring exhaustive ground truth.
The gap: MLflow 3 provides powerful primitives—but assembling them into comprehensive, domain-specific evaluation suites is still manual. Teams must analyze traces, design datasets, define scorers, and wire everything into runnable scripts. That overhead is exactly where autonomous generation becomes valuable.
Despite having the right tools, evaluation remains one of the most under-invested parts of agent development.
The reason is familiar: evaluation is like unit tests or documentation. Everyone agrees it’s necessary. Few teams implement it thoroughly. Even fewer keep it up to date.
In practice, teams struggle with:
The effort is front-loaded, tedious, and easy to deprioritize—especially when agents appear to “work” in demos.
This is the gap the agent harness framework addresses by automating the most time-consuming parts of evaluation setup, while leaving judgment and refinement to engineers and subject-matter-experts.
The framework generates three artifacts from your MLflow traces:
|
Artifact |
What it contains |
How it's used |
|---|---|---|
|
Evaluation Dataset |
Test cases derived from production traces + synthetic edge cases |
Input to mlflow.genai.evaluate() |
|
Custom Scorers |
Domain-specific LLM judges + programmatic checks |
Passed to evaluation as scorer list |
|
Run Script |
Complete Python script targeting your Databricks workspace |
Execute with python run_eval.py |
Key constraint: Every generated artifact is validated before output. Scorers compile. Datasets match the expected schema. Scripts run without errors. No "recommendations"—only working code.
The framework supports three strategies based on what you have:
|
Strategy |
When to Use |
Predict function (predict_fn) needed? |
|---|---|---|
|
From Traces |
Have existing traces, no agent code access |
No — outputs pre-computed |
|
Manual |
Need curated edge cases, have agent callable |
Yes — calls agent at eval time |
|
Hybrid |
Best coverage |
Both approaches combined |
Most deployments start with From Traces, evaluating responses the agent has already produced in development or production.
This framework doesn't fully automate evaluation—and that's intentional. Evaluation requires human judgment about what matters for your specific use case.
What it provides:
What remains human-driven:
The outcome is a continuous loop: traces → generated evaluation → review → improved evaluation → agent changes → new traces. Each cycle increases coverage and reliability.
Long-running generation tasks fail in predictable ways: agents accumulate context, attempt to solve too much in a single pass, or terminate before work is complete.
This framework follows Anthropic's research on effective harnesses for long-running agents by separating initialization from incremental execution, using short-lived sessions with fresh context and shared, persistent state.
An agent harness is the orchestration layer around the model that manages what the LLM cannot: task decomposition, session boundaries, state handoff, and failure recovery. Instead of relying on a single long-running agent, the harness executes work in bounded sessions with check-pointed state, enabling reliable progress and safe retries.
The framework is built using the Claude Agent SDK, Anthropic's framework for building autonomous agents that can use tools, maintain state across sessions, and execute multi-step tasks. Claude Agent SDK treats the file system as the source of truth. Tasks are executed by inspecting existing files, modifying them incrementally, validating outputs, and persisting state explicitly—mirroring how a human engineer would work.
Why Claude Agent SDK?
|
Capability |
How it’s used |
|
Tool Integration |
MCP tools for MLflow trace queries, experiment metadata, annotations, and metrics |
|
File System Access |
Read/write evaluation artifacts; use tools like ls, grep, diff, and structured edits to iteratively build and validate code |
|
Session Management |
Fresh model context per session with file-based state handoff |
|
Skills System |
Load verified MLflow 3.x API patterns and known gotchas before code generation |
|
Streaming Execution |
Real-time progress visibility during long-running generation and validation |
Databricks Integration: All inference runs through Claude Opus 4.5 via the Databricks Foundation Model API, keeping execution and data within the workspace boundary. Skills files, prompts, data, and intermediate state are stored in Unity Catalog Volumes, making the system auditable, reproducible, and fully governed by Databricks permissions.
This framework focuses on generating trace-driven evaluation infrastructure, not improving agent behavior directly, and is complementary to tools like Agent Bricks Learning from Human Feedback and DSPy.
Each session runs with fresh model context while reading and writing shared, file-backed state. This avoids the common failure mode of long-running agents—accumulated errors, drifting context, and brittle retries—while preserving continuity through explicit state handoff.
The initializer session analyzes traces and produces a concrete task plan. Each worker session then executes a single task, persists results, and exits. If validation fails, a “fix task” is added and picked up by a subsequent session. Progress is incremental, explicit, and restartable by design.
This session-based architecture aligns naturally with Databricks Jobs, not interactive notebooks:
|
Aspect |
Interactive Notebook |
Databricks Job |
|
Context |
State accumulates across cells |
Fresh environment per run |
|
Execution |
Requires manual oversight |
Runs unattended |
|
Failure Recovery |
Manual restart |
Automatic retry from persisted state |
|
Scheduling |
Ad hoc |
Scheduled or event-triggered |
|
Monitoring |
Console output |
Job UI, alerts, MLflow traces |
Because each session is stateless beyond what’s written to UC Volumes, retries are safe and deterministic. A failed task can be re-run without replaying the entire workflow.
Typical workflow:
The initializer session does not invent test cases—it derives them directly from observed behavior captured in MLflow traces.
Below is a representative excerpt from a real multi-agent deployment:
{
"agent_type": "Multi-Genie Orchestrator - A LangGraph-based supervisor
that routes business questions to specialized Databricks
Genie agents",
"trace_summary": {
"total_analyzed": 30,
"success_count": 25,
"error_count": 5,
"avg_latency_ms": 27000
},
"query_types_observed": [
{"type": "sales_pipeline", "routing": "customer_sales"},
{"type": "supply_chain", "routing": "supply_chain"},
{"type": "system_info", "routing": "none (supervisor handles directly)"}
],
"optimization_opportunities": [
{
"issue": "Suboptimal routing",
"description": "Supervisor sometimes calls both genies when only one is needed",
"impact": "5+ seconds wasted per unnecessary genie call"
}
]
}
Because inefficient routing is visible in production traces, the framework generates a dedicated efficient_routing scorer to measure and prevent this behavior.
After analyzing traces, the initializer produces a concrete evaluation plan:
Agent Understanding:
Evaluation Dimensions Identified:
|
Scorer |
Type |
Rationale |
|---|---|---|
|
Safety |
Builtin |
Required baseline - no harmful content |
|
RelevanceToQuery |
Builtin |
Responses must address business questions |
|
correct_routing |
Guidelines |
Route to appropriate Genie based on query type |
|
efficient_routing |
Guidelines |
Don't call unnecessary Genies |
|
data_presentation |
Guidelines |
Tables must be clear and well-formatted |
|
error_handling |
Custom |
Validate error response structure |
Dataset Strategy: traces (no predict_fn needed)
Task Plan Created (eval_tasks.json)
Output Files:
# From sessions/2025-12-19_093956/evaluation/eval_dataset.py
EVAL_DATA = [
# Sales Pipeline Query - should route to customer_sales only
# Status: OK but SUBOPTIMAL - called both genies unnecessarily
{
"inputs": {
"query": "How's my pipeline just in the americas and by segment?"
},
"outputs": {
"response": """Sales Pipeline for the Americas by Segment:
| | company_size_segment__c | sum(pipeline_amount) |
|---:|:--------------------------|-----------------------:|
| 0 | ENT | 15999.6 |
| 1 | MM | 18003.1 |
| 2 | SMB | 26001.2 |"""
},
"metadata": {
"source_trace": "tr-d45af7104661af36e7ccd1e81a76f2d8",
"query_type": "sales_pipeline",
"expected_routing": "customer_sales",
"actual_routing": "customer_sales, supply_chain",
"routing_optimal": False
}
},
# ... more test cases
]
Scorers tailored to your agent's domain, derived from trace analysis:
# From sessions/2025-12-19_093956/evaluation/scorers.py
from mlflow.genai.scorers import Guidelines, Safety, RelevanceToQuery
# Built-in scorers
safety_scorer = Safety()
relevance_scorer = RelevanceToQuery()
# Domain-specific: Correct routing for multi-agent orchestration
correct_routing_scorer = Guidelines(
name="correct_routing",
guidelines="""The agent should correctly route questions:
1. Sales questions → customer_sales Genie
2. Supply chain questions → supply_chain Genie
3. System questions → Answer directly (no Genie call)
4. Out-of-scope → Decline gracefully"""
)
# Performance: Identified from trace analysis showing suboptimal routing
efficient_routing_scorer = Guidelines(
name="efficient_routing",
guidelines="""The agent should minimize unnecessary Genie calls:
1. Pure sales questions: Only customer_sales Genie
2. Pure supply chain questions: Only supply_chain Genie
3. Only multi-domain questions should call both Genies
Unnecessary calls waste 5+ seconds each."""
)
def get_all_scorers():
return [
safety_scorer,
relevance_scorer,
correct_routing_scorer,
efficient_routing_scorer,
# ... additional scorers
]
After six sessions (one initializer, five workers), the evaluation suite executes successfully. The framework validates everything before declaring success:
{
"validation_status": "passed",
"metrics": {
"Safety/mean": 1.0,
"RelevanceToQuery/mean": 0.7,
"correct_routing/mean": 0.5,
"efficient_routing/mean": 0.9
},
"validation_checks": {
"script_executes": true,
"all_scorers_returned_values": true,
"no_nan_scores": true
}
}
Evaluation results in MLflow UI:
Providing the Agent with Skills - Verified API Patterns
Generated code fails most often when it targets outdated or incorrect APIs. MLflow’s GenAI evaluation interfaces have evolved rapidly, and examples from older blog posts or Q&A sites are frequently wrong.
To prevent this, the framework uses Anthropic’s Skills mechanism—a way to provide agents with verified, task-specific knowledge before they write any code (see Anthropic’s announcement: “Equipping agents for the real world with Agent Skills”). Skills act as a constrained reference layer, ensuring the agent follows known-correct patterns rather than relying on training-time assumptions.
The framework loads Skills files containing verified MLflow 3.1+ interfaces before generating any evaluation artifacts (see GitHub repo for more details on SKILLS here).
From .claude/skills/mlflow-evaluation/references/GOTCHAS.md:
|
Common Mistake |
Correct Pattern |
|---|---|
|
mlflow.evaluate() |
mlflow.genai.evaluate() |
|
{"query": "..."} (flat) |
{"inputs": {"query": "..."}} (nested) |
|
def predict_fn(inputs): |
def predict_fn(**inputs): (unpacked kwargs) |
|
Guidelines(guidelines="...") |
Guidelines(name="...", guidelines="...") |
|
from mlflow.metrics |
from mlflow.genai.scorers |
From CRITICAL-interfaces.md:
# Scorer function signature (verified for MLflow 3.1+)
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
@scorer
def my_scorer(
inputs: dict, # What was sent to the agent
outputs: dict, # What the agent returned
expectations: dict # Ground truth (optional)
) -> Feedback | bool | int | float:
return Feedback(value=True, rationale="Explanation")
By loading these Skills up front, the agent generates evaluation code that matches the current MLflow API, not outdated patterns from its training data.
Without verified interfaces, generated code often fails on first run:
The Skills system catches these at generation time, not runtime.
This framework is designed to run both locally, interactively in a notebook, and as a Databricks Job. Full setup instructions, configuration details, and examples are documented in the GitHub repository:
GitHub: https://github.com/alexmillerdb/mlflow-eval-agent
The generated evaluation can be run directly against your Databricks workspace, with results logged back to MLflow for tracking and comparison.
For step-by-step instructions—local development, Databricks deployment, and job configuration—refer to the repository README.
|
Aspect |
Manual Approach |
This Framework |
|---|---|---|
|
Time to first evaluation suite |
1-3 days |
< 2 hours |
|
Test cases derived from production |
Rarely done |
Standard |
|
Scorers that compile on first run |
~60% |
>95% |
|
Coverage of actual failure modes |
Ad-hoc |
Systematic |
The value isn’t just speed—it’s a repeatable signal. By removing the front-loaded, manual work that makes evaluation hard to sustain, MLflow traces can reliably feed evaluation, evaluation can surface real gaps and regressions, and engineers can focus on deciding what to fix or improve next.
By turning evaluation into trace-driven infrastructure rather than an ad-hoc task, this framework makes systematic measurement of agent behavior practical in production - using MLflow and Databricks primitives teams already rely on.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.