Databricks Community

Daniel-Liden · ‎06-01-2026

California beekeepers lost 21% of their honey bee colonies in the first quarter of 2024, the worst quarter for the state in at least a decade, according to the United States Department of Agriculture National Agricultural Statistics Service (USDA NASS). The data on what drove these losses lives in USDA data tables. The guidance on what to do about it lives in PDFs from USDA, university extension programs, and conservation groups.

A lot of useful questions across many different domains need both structured and unstructured data to answer. A natural followup to the colony loss statistic above is one of them: "Which stressors were most associated with those losses, and what are the recommended management practices for them?"

For an AI agent to answer a question like this, it needs to query the structured stressor data, search the guidance documents, and use what it finds in one to decide what to ask about the other. Critically, this is not simply retrieval over multiple sources. This kind of AI system needs to be smart enough to decompose the question and query the right data sources with the right sub-questions at the right time.

This is an example of grounded reasoning, a common and challenging pattern in many different enterprise and research settings. The Databricks AI research team has worked hard to optimize the Databricks supervisor agent for this kind of task.

In this post, we walk through a Databricks Supervisor Agent that does this end-to-end, coordinating a Genie Space over the USDA tables and a Knowledge Assistant over the beekeeping documents. We've published the full setup as the bee colony health demo.

Bee colony health

Perhaps best known for producing honey, honeybees are also among the most significant managed pollinators in the world and are critical to the production of many different kinds of crops. Thus it is very important to monitor honeybee populations, seasonal losses, and the stressors result in those losses.

A researcher or policymaker looking at this season's losses needs both the USDA tables (which stressors, by state and quarter) and the extension PDFs (treatment protocols, conservation programs), and the natural follow-up questions tend to span both.

The demo runs on real, public-domain data:

Three USDA NASS tables (~13,500 rows, 2015–2025) covering honey production, colony loss, and colony stressors.
Four PDFs (~140 pages) on varroa management, pollinator conservation, agricultural habitat, and native plants.

The CSV snapshots and PDFs are in the repo. All setup runs through a Declarative Automation Bundle (DAB); detailed instructions are in the README. The DAB creates three Delta tables from the CSVs, uploads the four PDFs to a Unity Catalog Volume, and creates the supervisor agent end-to-end with its two specialist sub agents:

A Genie Space for getting insights from the structured data using natural language.
A Knowledge Assistant for Q&A over the unstructured documents.

Configuring the agents

We give the Genie Space a short description of each table:

honey_production — State-level honey production data (2015-2025)
   - Columns: state, year, production, yield_per_colony, colonies, price_per_lb

   - Use for: production trends, yield analysis, colony counts, pricing

colony_loss — Colony deadout loss data (2015-2025)
   - Columns: state, year, quarter, loss_pct, loss_colonies

   - Use for: loss trends by state and quarter, identifying high-loss regions

colony_stressors — Colony stressor data (2015-2025)
   - Columns: state, year, quarter, stressor, pct_affected

   - Stressors: Varroa Mites, Pesticides, Disease, Pests, Other, Unknown

   - Use for: identifying which stressors drive colony loss, seasonal patterns

Genie generates SQL code from natural language queries against these tables and runs it. A Genie session keeps conversational context, so a follow-up like "now break that out by quarter" runs against the previous result. Genie returns up to 5,000 rows per query; past that you need to refine the question.

The Knowledge Assistant gets a similar description for each PDF. The supervisor agent gets the two sub-agents plus a few sentences of routing guidance:

Data/Statistics → Genie Space
   Questions about honey production, colony counts, loss rates, stressors.

Guidance/Best Practices → Knowledge Assistant
   Questions about varroa management, treatment protocols, USDA programs,

   habitat creation, native plants.

Combined → Both
   Questions that need both data context and expert guidance. Use data to

   establish context, then documents for actionable recommendations.

As you use the multi-agent system, you'll find places it gets things wrong. To improve it, you can add example questions and answers, add detail to the data or document descriptions, or update the routing instructions. You don't write a router, set up a vector store, or maintain a text-to-SQL pipeline yourself. You iterate on prose: descriptions, instructions, and examples.

Grounded reasoning in practice

To demonstrate grounded reasoning in action, let’s return to the example question we raised at the beginning:

What was the main cause of colony collapse in CA in 2023 and how do I address it?

The question doesn’t mention specific causes of colony collapse: formulating the second part of the question into a useful query for a knowledge base first requires us to answer the first part of the question. This is where the supervisor agent comes in. It decomposes the query and routes the right questions to the right agents at the right time.

First, it sends a rewritten question to the Genie Space:

genie_query: What were the stressors affecting California bee colonies in 2023? Show the stressor data by quarter for California in 2023.

Genie runs SQL against the colony_stressors table and returns the rows and a visualization:

Varroa mites are by far the dominant stressor across the 2023 data. The supervisor reads the results of the Genie query, determines that it needs guidance about varroa mites, and writes a second question, this one to the Knowledge Assistant:

ka_query: What are the recommended varroa mite management and treatment protocols for California beekeepers? What are the best practices for controlling varroa mites?

Note that the user never used the word "varroa" — the supervisor learned that word from Genie's data and used it to ask a more specific question of the Knowledge Assistant. The Knowledge Assistant retrieves passages from the varroa management guide and the supervisor puts the SQL rows and the document passages together into a final answer.

Here's the timeline of that trace in MLflow:

This kind of chained handling is what Databricks Research benchmarked on STaRK and KARLBench, two suites of questions designed to span structured and unstructured sources. The Databricks agent supervisor significantly improves retrieval and answer quality over more naive retrieval approaches, even when they use state of the art models.

Observing and Evaluating Multi-Agent Supervisor Behavior with MLflow

All queries to the agent system are logged as MLflow traces. You can audit the supervisor's routing decisions, the SQL Genie generated, the documents the Knowledge Assistant retrieved, the final answer, and per-step latency.

When a query returns a bad answer, you open the trace and look at where it went wrong:

Wrong sub-agent. Tighten the routing instructions, or add an example question that disambiguates.
Wrong SQL. Sharpen the table description in the Genie Space, or add an example Q&A pair using the right column.
Wrong document. Improve the document description in the Knowledge Assistant, or split a document that mixes topics.

This works well for debugging one trace at a time. But once you have a set of test questions, you also want a way to ask the same diagnostic questions across many runs: did the supervisor route correctly, did it answer with the expected facts, did it cover all parts of the question, and was the final response useful?

MLflow supports this kind of evaluation by letting you register trace-aware LLM Judges to your MLflow experiments. LLM judges are able to parse the individual components of a trace, not just the final answer, making them very powerful for evaluating the intermediate steps of an agent invocation.

This demo includes a companion notebook that runs a small evaluation set through the supervisor, captures each response as an MLflow trace, and scores those traces along four dimensions:

Scorers/Judges	What it measures
Routing Correctness	Did the supervisor route to the correct sub-agent(s)?
Answer Correctness	Does the response contain the expected facts?
Completeness	Does the response cover all expected elements?
Response Quality	Does the response meet domain quality standards?

These scores are attached to each trace, not just the final input/output. The judges can inspect the traces to understand the sub-agent routing, the Genie-generated SQL, and the sources retrieved by the knowledge assistant, and use this information to score each trace. The MLflow UI gives a convenient dashboard for inspecting traces, evaluation results, and rationales for scores.

Other Retrieval Approaches

There are several other ways to handle hybrid retrieval. Each one makes a different tradeoff.

Approach	Best for…	Tradeoffs
Databricks supervisor agent (Genie + Knowledge Assistant + supervisor)	Hybrid structured/unstructured sources where you also want tracing, evaluation, scaling, and governance handled	Less precise control over orchestration; you're trusting the supervisor to interpret your prose correctly
Single retrieval (vector search or text-to-SQL only)	Predictable question shapes; mostly one data type; cross-format mapping is straightforward	Loss of fidelity when extracting structure from documents; vague questions map poorly to SQL; agents struggle with data embedded in text
Build your own multi-agent system with LangGraph, CrewAI, or similar	You already have vector search, a SQL warehouse, and an agent framework you trust, plus a team to maintain them	You write the routing and chaining code yourself, plus the observability and governance plumbing the sub-agents would otherwise sit on top of
Raw filesystem read tools over data files and documents	Small, changing corpus; low query volume; a strong model and you can absorb the per-call cost	Doesn't scale; token-inefficient; you're paying for an expensive model to do structural work

For most engineers, the relevant comparison is build-your-own. With LangGraph or CrewAI you write the routing and chaining code yourself, and you wire up observability and governance from scratch. The supervisor agent has those built in, but trades away the fine-grained control over orchestration. Pick the one your team is set up to maintain.

Try it on your data

The bee colony health demo is an easy starting point for your own data. Replace the USDA tables with your own Delta tables and the PDFs with your own documents, rewrite the descriptions, and the same DAB will stand up the supervisor over your sources.

Databricks Community

Multi-Agent Supervisor for Hybrid Retrieval with Agent Bricks and MLflow

Bee colony health

Configuring the agents

Grounded reasoning in practice

Observing and Evaluating Multi-Agent Supervisor Behavior with MLflow

Other Retrieval Approaches

Try it on your data

Resources and References

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks