Hi @shivamrai162 ,
Did you add the last 10 traces to the evaluation dataset? You can follow the steps here to make sure you added the traces to the evaluation dataset.
To answer your second question, here is a good article that covers the concepts and data model of MLFlow for GenAI: https://docs.databricks.com/aws/en/mlflow3/genai/concepts/
This article also links to a few other examples that can help you better understand each of the sidebar options: https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor
I'll also include a quick summary for each of the buttons below:
-
Traces: Observability of captured interactions. You can export selected traces to an evaluation dataset from here.
-
Sessions: Conversation-level grouping and observability. Multi-turn evaluations and UI concepts are centered around session groupings
-
Scorers: Where you define and manage evaluation functions that adapt traces into judge inputs (built-in or custom). Scorers extract request/response/context from traces and call LLM judges or your code.
-
Datasets: Curated evaluation sets built from traces, labeling sessions, synthetic data, or imports. Used as the source of truth for evaluation runs.
-
Evaluation runs: Executions of scorers against a dataset to produce comparable quality results across agent versions.
-
Labeling schemas: Structured questions (feedback and expectations) used in labeling sessions. Includes built-ins like "guidelines", "expected_facts", and "expected_response".
-
Labeling sessions: Queues of traces or dataset records sent to SMEs for review in the Review App. Labels become Assessments attached to traces and can be synced back to datasets.
-
Prompts: Version controlled templates for LLM prompts
-
Agent versions: Experiment-level tracking of the artifacts and versions you evaluate and compare in the UI.