Databricks Community

Daniel-Liden · ‎06-01-2026

California beekeepers lost 21% of their honey bee colonies in the first quarter of 2024, the worst quarter for the state in at least a decade, according to the United States Department of Agriculture National Agricultural Statistics Service (USDA NASS). The data on what drove these losses lives in USDA data tables. The guidance on what to do about it lives in PDFs from USDA, university extension programs, and conservation groups.

A lot of useful questions across many different domains need both structured and unstructured data to answer. A natural followup to the colony loss statistic above is one of them: "Which stressors were most associated with those losses, and what are the recommended management practices for them?"

For an AI agent to answer a question like this, it needs to query the structured stressor data, search the guidance documents, and use what it finds in one to decide what to ask about the other. Critically, this is not simply retrieval over multiple sources. This kind of AI system needs to be smart enough to decompose the question and query the right data sources with the right sub-questions at the right time.

This is an example of grounded reasoning, a common and challenging pattern in many different enterprise and research settings. The Databricks AI research team has worked hard to optimize the Databricks supervisor agent for this kind of task.

In this post, we walk through a Databricks Supervisor Agent that does this end-to-end, coordinating a Genie Space over the USDA tables and a Knowledge Assistant over the beekeeping documents. We've published the full setup as the bee colony health demo.

Bee colony health

Perhaps best known for producing honey, honeybees are also among the most significant managed pollinators in the world and are critical to the production of many different kinds of crops. Thus it is very important to monitor honeybee populations, seasonal losses, and the stressors result in those losses.

A researcher or policymaker looking at this season's losses needs both the USDA tables (which stressors, by state and quarter) and the extension PDFs (treatment protocols, conservation programs), and the natural follow-up questions tend to span both.

The demo runs on real, public-domain data:

Three USDA NASS tables (~13,500 rows, 2015–2025) covering honey production, colony loss, and colony stressors.
Four PDFs (~140 pages) on varroa management, pollinator conservation, agricultural habitat, and native plants.

The CSV snapshots and PDFs are in the repo. All setup runs through a Declarative Automation Bundle (DAB); detailed instructions are in the README. The DAB creates three Delta tables from the CSVs, uploads the four PDFs to a Unity Catalog Volume, and creates the supervisor agent end-to-end with its two specialist sub agents:

A Genie Space for getting insights from the structured data using natural language.
A Knowledge Assistant for Q&A over the unstructured documents.

Configuring the agents

We give the Genie Space a short description of each table:

honey_production — State-level honey production data (2015-2025)
   - Columns: state, year, production, yield_per_colony, colonies, price_per_lb

   - Use for: production trends, yield analysis, colony counts, pricing

colony_loss — Colony deadout loss data (2015-2025)
   - Columns: state, year, quarter, loss_pct, loss_colonies

   - Use for: loss trends by state and quarter, identifying high-loss regions

colony_stressors — Colony stressor data (2015-2025)
   - Columns: state, year, quarter, stressor, pct_affected

   - Stressors: Varroa Mites, Pesticides, Disease, Pests, Other, Unknown

   - Use for: identifying which stressors drive colony loss, seasonal patterns

Genie generates SQL code from natural language queries against these tables and runs it. A Genie session keeps conversational context, so a follow-up like "now break that out by quarter" runs against the previous result. Genie returns up to 5,000 rows per query; past that you need to refine the question.

The Knowledge Assistant gets a similar description for each PDF. The supervisor agent gets the two sub-agents plus a few sentences of routing guidance:

Data/Statistics → Genie Space
   Questions about honey production, colony counts, loss rates, stressors.

Guidance/Best Practices → Knowledge Assistant
   Questions about varroa management, treatment protocols, USDA programs,

   habitat creation, native plants.

Combined → Both
   Questions that need both data context and expert guidance. Use data to

   establish context, then documents for actionable recommendations.

As you use the multi-agent system, you'll find places it gets things wrong. To improve it, you can add example questions and answers, add detail to the data or document descriptions, or update the routing instructions. You don't write a router, set up a vector store, or maintain a text-to-SQL pipeline yourself. You iterate on prose: descriptions, instructions, and examples.

Grounded reasoning in practice

To demonstrate grounded reasoning in action, let’s return to the example question we raised at the beginning:

What was the main cause of colony collapse in CA in 2023 and how do I address it?

The question doesn’t mention specific causes of colony collapse: formulating the second part of the question into a useful query for a knowledge base first requires us to answer the first part of the question. This is where the supervisor agent comes in. It decomposes the query and routes the right questions to the right agents at the right time.

First, it sends a rewritten question to the Genie Space:

genie_query: What were the stressors affecting California bee colonies in 2023? Show the stressor data by quarter for California in 2023.

Genie runs SQL against the colony_stressors table and returns the rows and a visualization:

Varroa mites are by far the dominant stressor across the 2023 data. The supervisor reads the results of the Genie query, determines that it needs guidance about varroa mites, and writes a second question, this one to the Knowledge Assistant:

ka_query: What are the recommended varroa mite management and treatment protocols for California beekeepers? What are the best practices for controlling varroa mites?

Note that the user never used the word "varroa" — the supervisor learned that word from Genie's data and used it to ask a more specific question of the Knowledge Assistant. The Knowledge Assistant retrieves passages from the varroa management guide and the supervisor puts the SQL rows and the document passages together into a final answer.

Here's the timeline of that trace in MLflow:

This kind of chained handling is what Databricks Research benchmarked on STaRK and KARLBench, two suites of questions designed to span structured and unstructured sources. The Databricks agent supervisor significantly improves retrieval and answer quality over more naive retrieval approaches, even when they use state of the art models.

Observing and Evaluating Multi-Agent Supervisor Behavior with MLflow

All queries to the agent system are logged as MLflow traces. You can audit the supervisor's routing decisions, the SQL Genie generated, the documents the Knowledge Assistant retrieved, the final answer, and per-step latency.

When a query returns a bad answer, you open the trace and look at where it went wrong:

Wrong sub-agent. Tighten the routing instructions, or add an example question that disambiguates.
Wrong SQL. Sharpen the table description in the Genie Space, or add an example Q&A pair using the right column.
Wrong document. Improve the document description in the Knowledge Assistant, or split a document that mixes topics.

This works well for debugging one trace at a time. But once you have a set of test questions, you also want a way to ask the same diagnostic questions across many runs: did the supervisor route correctly, did it answer with the expected facts, did it cover all parts of the question, and was the final response useful?

MLflow supports this kind of evaluation by letting you register trace-aware LLM Judges to your MLflow experiments. LLM judges are able to parse the individual components of a trace, not just the final answer, making them very powerful for evaluating the intermediate steps of an agent invocation.

This demo includes a companion notebook that runs a small evaluation set through the supervisor, captures each response as an MLflow trace, and scores those traces along four dimensions:

Scorers/Judges	What it measures
Routing Correctness	Did the supervisor route to the correct sub-agent(s)?
Answer Correctness	Does the response contain the expected facts?
Completeness	Does the response cover all expected elements?
Response Quality	Does the response meet domain quality standards?

These scores are attached to each trace, not just the final input/output. The judges can inspect the traces to understand the sub-agent routing, the Genie-generated SQL, and the sources retrieved by the knowledge assistant, and use this information to score each trace. The MLflow UI gives a convenient dashboard for inspecting traces, evaluation results, and rationales for scores.

Other Retrieval Approaches

There are several other ways to handle hybrid retrieval. Each one makes a different tradeoff.

Approach	Best for…	Tradeoffs
Databricks supervisor agent (Genie + Knowledge Assistant + supervisor)	Hybrid structured/unstructured sources where you also want tracing, evaluation, scaling, and governance handled	Less precise control over orchestration; you're trusting the supervisor to interpret your prose correctly
Single retrieval (vector search or text-to-SQL only)	Predictable question shapes; mostly one data type; cross-format mapping is straightforward	Loss of fidelity when extracting structure from documents; vague questions map poorly to SQL; agents struggle with data embedded in text
Build your own multi-agent system with LangGraph, CrewAI, or similar	You already have vector search, a SQL warehouse, and an agent framework you trust, plus a team to maintain them	You write the routing and chaining code yourself, plus the observability and governance plumbing the sub-agents would otherwise sit on top of
Raw filesystem read tools over data files and documents	Small, changing corpus; low query volume; a strong model and you can absorb the per-call cost	Doesn't scale; token-inefficient; you're paying for an expensive model to do structural work

For most engineers, the relevant comparison is build-your-own. With LangGraph or CrewAI you write the routing and chaining code yourself, and you wire up observability and governance from scratch. The supervisor agent has those built in, but trades away the fine-grained control over orchestration. Pick the one your team is set up to maintain.

Try it on your data

The bee colony health demo is an easy starting point for your own data. Replace the USDA tables with your own Delta tables and the PDFs with your own documents, rewrite the descriptions, and the same DAB will stand up the supervisor over your sources.

Resources and References

SalmanJafferCFA · ‎06-09-2026

has there been any benchmarking against this solution and a GenAI approach? Or was this example chosen as an example of a query space in a non-public domain?

https://share.google/aimode/Caa7JtD2YNqQH9e8F

Daniel-Liden · ‎06-09-2026

Hey @SalmanJafferCFA ! Can you clarify what you are asking? This example did use fully open source/public data (you can find sources in the repo).

Are you asking about the difference between using an approach like this and just asking an LLM model and relying on its native search? There are a few reasons you might want to do so:

In some cases, it's going to be cheaper to use specialist models that are highly optimized for retrieval and answering and reasoning over data.
You might be in a situation where you need to be absolutely certain that the answers are backed up by a particular canonical source, and not just what happens to be retrieved from the Internet with citations.
Relatedly, different AI models and agents have different approaches to searching and to using information from the internet and can return different answers from each other. Again, if you have canonical, trusted sources and you need to know that the answers are rooted in particular data and particular sources, then you want an approach that is retrieving from your preferred data.

As far as benchmarks go I would suggest reading:
- https://www.databricks.com/blog/meet-karl-faster-agent-enterprise-knowledge-powered-custom-rl
- https://www.databricks.com/blog/agentic-reasoning-practice-making-sense-structured-and-unstructured-...

Though these are not comparing retrieval versus model-native open web search, they are comparing how different models and agents perform on retrieval tasks.

Hope this helps!

SalmanJafferCFA · ‎06-09-2026

possible to provide some examples other than the bees where it is cheaper to use specialist models developed with a tool like Databricks that are highly optimized for retrieval and answering and reasoning over data as opposed to free online tools such as Gemini and ChatGPT? Isn't model-native open web search a form of a model and agent on a retrieval task? This article goes some way to answering my query thanks! 🙂

Meet KARL: A faster agent for enterprise knowledge, powered by custom RL | Databricks Blog

Daniel-Liden · ‎06-12-2026

Great questions! (and thanks for reading!)

Isn't model-native web search already a form of a model and agent on a retrieval task?

Certainly it is! It's less about general retrieval, though, and more about what is being retrieved. Web search retrieves from the open internet while enterprise agents retrieve from data that isn't on the web at all.

When is it cheaper or better to use a specialist model instead of free online tools like ChatGPT or Gemini?

The clearest case is proprietary data. If your organization has, for example, a collection of legal documents, customer account data, financial reports, etc., you won't be able to get this from the web & you likely would not be permitted to use free-tier models for analysis anyway, on account of their data retention policies. So proprietary documents, in general, are one such use case.

For another, you may be interested in the guidance that exists in a specific set of tables and documents. The bee example above pulled a table from the Ontario Beekeepers' Association, but maybe you work for an organization where you are required to follow guidance issued by a different organization. In that case you need retrieval grounded in that set of documents, not whatever the open web ranks highest.

Do specialist models actually perform better on complex enterprise data?

Lastly, especially for more complex data, you are likely to see better performance from a specialist model that is very good at writing SQL (Genie) in an environment where it has access to your schemas, metadata, and metrics. A general-purpose model doesn't have access to this context. Working with a generalist model and correcting numerous false assumptions, misinterpretations, etc., can be both costly and inaccurate. That's part of what the KARL article demonstrates.

Databricks Community

Multi-Agent Supervisor for Hybrid Retrieval with Agent Bricks and MLflow

Bee colony health

Configuring the agents

Grounded reasoning in practice

Observing and Evaluating Multi-Agent Supervisor Behavior with MLflow

Other Retrieval Approaches

Try it on your data

Resources and References

Isn't model-native web search already a form of a model and agent on a retrieval task?

When is it cheaper or better to use a specialist model instead of free online tools like ChatGPT or Gemini?

Do specialist models actually perform better on complex enterprise data?

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks