topic Re: Not able to add scorer to multi agent supervisor in Generative AI

Not able to add scorer to multi agent supervisor

shivamrai162 — Thu, 20 Nov 2025 03:58:35 GMT

Hello,

When I try to add scorers to Multi agent endpoint based on the last 10 traces that I have logged and visible in the experiments tab, i get this error.

Also, are there any demos which i can refer regarding the tabs within the evaluation bar explaining how they can be leveraged?

Re: Not able to add scorer to multi agent supervisor

stbjelcevic — Mon, 01 Dec 2025 23:14:44 GMT

Hi @shivamrai162 ,

Did you add the last 10 traces to the evaluation dataset? You can follow the steps here to make sure you added the traces to the evaluation dataset.

To answer your second question, here is a good article that covers the concepts and data model of MLFlow for GenAI: https://docs.databricks.com/aws/en/mlflow3/genai/concepts/

This article also links to a few other examples that can help you better understand each of the sidebar options: https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor

I'll also include a quick summary for each of the buttons below:

Traces: Observability of captured interactions. You can export selected traces to an evaluation dataset from here.
Sessions: Conversation-level grouping and observability. Multi-turn evaluations and UI concepts are centered around session groupings
Scorers: Where you define and manage evaluation functions that adapt traces into judge inputs (built-in or custom). Scorers extract request/response/context from traces and call LLM judges or your code.
Datasets: Curated evaluation sets built from traces, labeling sessions, synthetic data, or imports. Used as the source of truth for evaluation runs.
Evaluation runs: Executions of scorers against a dataset to produce comparable quality results across agent versions.
Labeling schemas: Structured questions (feedback and expectations) used in labeling sessions. Includes built-ins like "guidelines", "expected_facts", and "expected_response".
Labeling sessions: Queues of traces or dataset records sent to SMEs for review in the Review App. Labels become Assessments attached to traces and can be synced back to datasets.
Prompts: Version controlled templates for LLM prompts
Agent versions: Experiment-level tracking of the artifacts and versions you evaluate and compare in the UI.

Re: Not able to add scorer to multi agent supervisor

shivamrai162 — Fri, 19 Dec 2025 06:11:27 GMT

Thanks a lot, that makes the workflow clear.

I have a couple of points with which i needed help

* Since it mentioned traces, i thought it was going to refer the last 10 traces present in the traces section instead of the eval dataset.
I did have around 50 traces in the traces section although the evaluation dataset was empty.

* I also notice issue where after defining a scorer, when i click "run scorer" through the ui using last n traces, it does nothing. The traces section also doesn't show up pass/fail check within the assessment column.

* When the evaluating Traces: ON is mentioned for any scorer, it doesn't evaluate new traces that get generated.

* After defining the evaluation dataset and building it from selective traces, the dataset section only shows the empty delta table that was created and not the added selective traces.

* Also wanted to understand what is the "Examples" (formerly "improve quality" tab) where we add questions and guidelines and then start labelling session. Is it only for the human reviewer or does it relate with any scorer to assess all traces with general questions and guidelines

* Since this is in beta, are there any known UI level bugs in the experiment tab?