cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Not able to add scorer to multi agent supervisor

shivamrai162
New Contributor III

Hello,

When I try to add scorers to Multi agent endpoint based on the last 10 traces that I have logged and visible in the experiments tab, i get this error.

shivamrai162_0-1763609354150.png

Also, are there any demos which i can refer regarding the tabs within the evaluation bar explaining how they can be leveraged? 

shivamrai162_2-1763609468060.png

 

1 REPLY 1

stbjelcevic
Databricks Employee
Databricks Employee

Hi @shivamrai162 ,

Did you add the last 10 traces to the evaluation dataset? You can follow the steps here to make sure you added the traces to the evaluation dataset.

To answer your second question, here is a good article that covers the concepts and data model of MLFlow for GenAI: https://docs.databricks.com/aws/en/mlflow3/genai/concepts/

This article also links to a few other examples that can help you better understand each of the sidebar options: https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor

I'll also include a quick summary for each of the buttons below:

  • Traces: Observability of captured interactions. You can export selected traces to an evaluation dataset from here.

  • Sessions: Conversation-level grouping and observability. Multi-turn evaluations and UI concepts are centered around session groupings

  • Scorers: Where you define and manage evaluation functions that adapt traces into judge inputs (built-in or custom). Scorers extract request/response/context from traces and call LLM judges or your code.

  • Datasets: Curated evaluation sets built from traces, labeling sessions, synthetic data, or imports. Used as the source of truth for evaluation runs.

  • Evaluation runs: Executions of scorers against a dataset to produce comparable quality results across agent versions.

  • Labeling schemas: Structured questions (feedback and expectations) used in labeling sessions. Includes built-ins like "guidelines", "expected_facts", and "expected_response".

  • Labeling sessions: Queues of traces or dataset records sent to SMEs for review in the Review App. Labels become Assessments attached to traces and can be synced back to datasets.

  • Prompts: Version controlled templates for LLM prompts

  • Agent versions: Experiment-level tracking of the artifacts and versions you evaluate and compare in the UI.