Databricks Community

pulkitpareek

It’s Monday Morning. Something’s Wrong.

It’s 7:15 AM. The VP of Manufacturing at a global pharmaceutical company opens her operations dashboard before her leadership meeting. Five manufacturing sites. Twelve active product lines. Thousands of data points flowing in from production records, quality systems, environmental monitors, and issue logs.

Let’s walk through a scenario modeled on the kinds of manufacturing challenges we see across Databricks customers.

She flagged a problem: a sharp yield drop at the South San Francisco plant, building over two months, with quality issues spiking and no obvious explanation. Since then, her team has been investigating via cross-functional meetings, manual data pulls, theories formed and tested. They know what is happening and roughly where. They can’t figure out why.

This is where most investigations stall. Root cause analysis for a complex quality problem can take weeks, resulting in a growing cost impact that compounds every day the cause remains unknown.

But in this scenario, the VP has Databricks AI/BI Genie with Agent mode. And she’s about to complete an investigation that would normally take her team weeks in under four minutes.

The Gap Between “What” and “Why”

Business intelligence has gotten very good at answering what happened. Dashboards show trends and flag problems. But for operations leaders, knowing what is only the beginning. The real value, and the real difficulty, lies in answering why.

These “why” questions have traditionally required a human analyst to come up with theories, test them against data, and iterate, often over weeks. The data usually exists. The challenge isn’t access. It’s that answering “why” requires exploring dozens of factors in combination, finding connections no single query reveals. That’s an investigation challenge, not a data challenge.

This is exactly the gap that AI/BI Genie with Agent mode is designed to close.

A Quick Guide to the Metrics

You’ll see these five metrics throughout the story. Here’s what they mean:

Metric	What It Means	Why It Matters
Yield (%)	The percentage of production that passes quality testing.	Every 1% drop means millions in lost revenue.
RFT (%)	Right-First-Time - batches that pass quality checks on the first attempt.	When it drops, batches need rework: delays, extra cost, production backups.
OEE (%)	Overall Equipment Effectiveness - is the equipment running, at speed, and producing good output?	Tells you if the problem is the machines or the process.
Deviations	Documented records of anything that went wrong during production.	Unresolved deviations are a compliance risk. In pharma, they can trigger regulatory action.
Cost of Quality ($)	Total financial impact of quality problems - wasted material, investigation labor, retesting.	The number that gets the CFO’s attention.

Phase 1: Seeing the Problem

The VP opens her AI/BI Dashboard. It pulls from curated data tables in Unity Catalog, all governed by the same access controls that power the rest of the platform.

The top-line numbers look reassuring:

Network yield: 97.33%.
Equipment efficiency: 80.56%, well within target.
Right-first-time quality: 99.78%.

But the daily yield trend tells a different story. Every site holds steady around 97–98%, except South San Francisco (SSF), which drops sharply starting in early October, dipping below 96% through November.

SSF is the highest-volume facility in the network, so even a small yield drop carries outsized impact. Equipment efficiency shows all sites tracking steadily, SSF included. This isn’t an equipment problem. Something else is going on.

The dashboard did its job: it showed something was wrong. But it can’t explain why.

Phase 2: Isolating the Problem

She clicks “Ask Genie”, a conversational interface connected to the same governed data. No SQL. No exports. Just natural language.

Question 1: “Which site-product combinations contributed the most to yield loss in October and November 2025?”

SSF ATX-200 Tablet dominates: 39,619 kg of yield loss, nearly three times the next highest. The dashboard showed a pattern; Genie put a number on it.

Question 2: “Compare SSF ATX-200 versus Cork ATX-200 and SSF ACY-150 on yield and deviations.”

Same product at a different site, different product at the same site. The results are stark:

Line	Yield	Deviations
SSF ATX-200	94.67%	939
Cork ATX-200 (same product)	97.20%	161
SSF ACY-150 (same site)	97.61%	78

Key Insight: Not a site-wide problem, the capsule line in the same facility is fine. Not a product-wide problem, Cork’s ATX-200 is fine. Something changed specifically in how SSF manufactures ATX-200 tablets.

Question 3: “What is the estimated cost impact?”

Genie calculates: $2.2 million in incremental costs above normal, with scrap alone at nearly $4 million. But the real alarm isn’t the dollar figure: it’s the unknown root cause and the growing backlog of unresolved quality issues.

In just three questions, she knows the what, where, when, and how much. Genie answered our questions accurately but it hasn’t answered why this is happening: Standard Genie chat produces a single query and output table at a time, by design. What we ultimately need is a comprehensive investigation, though, not a data table. Answering why often requires multiple levels of analyses and exploring questions we haven’t asked directly. Enter Genie Agent mode.

Phase 3: Answering “Why”

She clicks Agent mode and asks the same question. What happens next is fundamentally different.

“Investigate the root cause of the SSF ATX-200 yield decline starting in October 2025. Analyze raw materials, process parameters, environmental conditions, and any site-specific changes.”

Agent mode doesn’t run a single query. It builds an investigation plan, theories to test, data to examine, steps to follow. It queries across manufacturing, materials, environmental, and quality data, learns from each result, and assembles a comprehensive, cited report.

A note on Agent mode: This story showcases a complex “why” question, but Agent mode uses the same multi-step reasoning on everyday questions too. The agentic approach makes all answers more reliable, not just the complex ones.

What Agent Mode Discovered

In under four minutes, Agent mode delivered a comprehensive, cited report. Here’s the story it pieced together:

1. The Trigger: A Supplier Change Nobody Connected to the Problem

In late September, SSF approved a routine change to start using a secondary supplier, Vendor B, for a key tablet ingredient called microcrystalline cellulose (MCC), the main filler that holds a tablet together. The change was classified as “Low Risk” with “no expected impact on product quality.”

But a supply constraint from the primary vendor forced SSF to ramp Vendor B faster than planned: from 5.5% to nearly 58% of the ingredient supply. The qualification testing had only validated performance at less than 10% usage. Nobody tested what would happen at production-scale volumes. The yield drop started within days.

2. The Mechanism: A Small Material Difference with Big Consequences

Vendor B’s material absorbed more moisture during a critical mixing step. On average, the shift looked modest. But the worst batches were far outside normal range: the upper end of moisture readings jumped 14% above baseline peaks, pushing individual batches beyond the process’s validated limits. That extra moisture threw off the tablet-pressing step downstream. Tablets started coming out too soft, triggering quality failures, rework, and scrap.

3. The Amplifier: A Humidity Problem That Genie Missed Entirely

During the same period, SSF’s manufacturing rooms experienced 74 humidity excursions: times when room humidity exceeded safe limits. During the baseline period? Zero. The elevated humidity made Vendor B’s moisture-sensitive material even harder to control, compounding the problem.

Genie’s single-pass analysis reported “no qualified humidity excursions.” Finding this connection required querying environmental monitoring data separately and correlating it with the production timeline: exactly the kind of multi-step investigation Agent mode does automatically.

4. The Impact: 53 Days, $2 Million, and 878 Open Quality Issues

Agent mode mapped the full arc: a 53-day incident from October 3 to November 24. Yield dropped from 97.17% to 94.55%. Quality issues surged from 4–5 per day to 15–25, with 878 opened and only 475 closed, creating a backlog that overwhelmed the quality team. Incremental cost: $2.0 million, driven by $3.5 million in scrapped production.

Meanwhile, the control lines confirmed this was isolated: Cork’s ATX-200 held at 97.20% and SSF’s capsule line ran at 97.61%. Agent mode traced it all back to a single change control approved in September.

Critically, all substandard batches were caught and contained, rejected or reworked before any impacted product reached patients. In pharmaceutical manufacturing, that’s the first question everyone asks. Agent mode’s cited report gave the quality team a clear, auditable answer in minutes.

Why This Matters: Every finding is traceable to specific investigation steps and queries. The VP can click through to verify any conclusion, and if Agent mode surfaces a correlation that doesn’t hold up under scrutiny, the citations make it fast to identify and discard. This isn’t a black box: it’s an auditable investigation that completes in minutes.

Genie found the headline. Agent mode found the full story. Both identified the vendor shift. Genie helps answer questions in minutes that often takes teams weeks to do manually. Supercharging the investigation with Agent mode, it discovered 74 humidity excursions, traced the causal chain, provided financial impact, and offered corrective recommendations. All in under four minutes.

Agent Mode Findings at a Glance

For the detail-oriented reader, here’s the full baseline-versus-incident comparison from Agent mode’s report:

Metric	Baseline	During Incident
SSF ATX-200 Yield	97.17%	94.55% (−2.62 pp)
Vendor B Ingredient Share	5.5%	57.9% (peak 75%)
Granulation Moisture - Avg	2.56%	2.80%
Granulation Moisture - Worst 5% of Batches	2.78%	3.18% (+14%)
Tablet Hardness Score (avg)	11.47	10.40 (tablets too soft)
Hardness Failure Rate	Near zero	4.7%
Humidity Excursions	0	74 (1.4/day)
Quality Deviations Opened	4–5/day	15–25/day (878 total)
Incremental Cost of Quality	—	$2.0M
Cork ATX-200 (control)	97.20%	97.20% (stable)
SSF ACY-150 (control)	97.61%	97.61% (stable)

The Resolution: From Root Cause to Recovery

With the root cause clear, the team acted fast:

Tightened incoming material specs (November 25): added particle size and moisture requirements for the ingredient, and adjusted the manufacturing process to be less sensitive to supplier differences.

Capped new supplier volumes (December 1): new suppliers now limited to 30% of monthly volume until a 90-day performance review. Updated procedures and retrained staff.

Yield recovered to 96.45%, humidity excursions returned to zero, and right-first-time rate recovered to 99.9%. Performance stabilized and continued trending toward pre-incident levels.

Agent mode didn’t replace the formal quality process, corrective actions still went through proper review and approval. What it changed was the starting point. Instead of spending weeks building hypotheses manually, the quality team had a validated investigation within minutes, giving them a head start of weeks on getting the problem fixed.

The causal chain: New supplier material (qualified at <10%, deployed at 58% due to supply constraints) → moisture absorption shift → tablet failures → compounded by 74 humidity excursions → yield collapse and $2.0M in costs. Traced and cited in under four minutes.

The Three Phases

What makes this compelling isn’t any single capability: it’s the progression:

	Dashboard	Genie Chat	Genie Agent Mode
Question	What happened?	What exactly? How much?	Why did it happen?
Interaction	Visual, pre-built	Conversational, ad hoc	Investigative, multi-step
Scope	Fixed views	Flexible queries	Connections across datasets
Output	Charts and metrics	Tables and summaries	Cited research reports
Human effort	Read and interpret	Ask the right questions	Ask one question, get an investigation

All three phases share the same foundation: governed data in Unity Catalog, seamless transitions between seeing a problem, exploring it, and understanding it. One system.

Why This Matters

This isn’t just a pharmaceutical story. Every operations-intensive industry, automotive, semiconductor, food & beverage, energy, faces problems whose root causes involve hidden connections across multiple data sources. The data almost always exists. The question is whether your team can explore all the possible explanations fast enough.

When root cause analysis takes minutes instead of weeks, two things change:

Problems get solved faster. Every week of undiagnosed yield loss costs more.
More problems get investigated. When investigation takes weeks, only the biggest fires get attention. When it takes minutes, smaller issues that used to be ignored now get answers.

The Bottom Line: Every company has the data to answer its toughest operational questions. The real unlock isn’t just AI: it’s bringing that data together into a governed platform where AI can actually reach it. Once it’s there, the investigation that used to take your best analysts weeks can happen in minutes.

Minutes, not weeks. Natural language, not SQL. Cited findings, not hunches. Powered by Databricks Genie. Check it out, here.