In Part 1 of this series, we focused on parsing enterprise PDFs, and this blog we build on that work and show how those chunks are combined with structured use-case data to produce consistent, account-ready outputs using LLMs.
Large language models (LLMs) are increasingly used to generate decision-making content from a mix of unstructured documents and structured system data. In practice, this often breaks down once correctness, groundness, and repeatability start to matter.
This post describes a production-oriented pattern for building reliable LLM pipelines across multiple sources of truth. While the concrete example is an account newsletter, the architecture generalizes to any LLM-driven synthesis where evidence, context, and output quality must be controlled.
The focus is not on model choice or prompt tricks, but on system structure: how inputs are separated, how reasoning is staged, and how constraints are enforced end-to-end.
Before talking about architecture, it’s worth defining what “useful” means for LLM-generated content.
In most enterprise environments, data is not scarce. Teams already have PDFs from workshops and reviews, CRM or Salesforce use cases, dashboards, and many ad-hoc notes. The problem is not collecting more information. It is turning scattered, noisy inputs into a short, reliable view of what is happening and what needs attention.
The AI-generated account newsletter used in this series is one concrete example of this broader problem. It represents a common LLM workload: combining multiple sources into a structured summary that highlights priorities, risks, and actions. The same pattern shows up in executive updates, operational reviews, and internal status reports.
For this class of use cases, a useful LLM output must meet three criteria.
Actionable
It should surface what matters now: key initiatives, risks, blockers, and next steps.
Trustworthy
Every statement must be traceable to a real source, such as a specific PDF page or a structured record. If something cannot be grounded, it should not be included.
Consistent
The same inputs should produce similar outputs across runs. Results should not depend on luck with prompts or manual cleanup.
These constraints shaped the pipeline design. Instead of asking an LLM to “summarize everything,” the system enforces structure, prioritization, and citation rules. The goal is not fluent text, but a reliable signal — content that can be read quickly and trusted.
This design aligns with emerging prompt-optimization approaches such as DSPy and MLflow’s Prompt Optimizer, which treat prompts and LLM programs as objects that can be evaluated, tuned, and versioned. Rather than relying on trial-and-error prompting, the system is built to be measurable and reproducible.
Enterprise Documents (typically in the form of PDFs) are a rich source of information. Workshop decks, QBR slides, architecture reviews, and meeting notes often contain important decisions, risks, and technical details. In Part 1, we focused on turning these documents into structured, citeable chunks that an LLM can safely consume.
However, PDFs alone are not enough to generate useful, decision-facing content.
First, PDFs are inherently backward-looking. They capture what was discussed at a point in time, but they do not reliably reflect current status. A slide stating “target go-live in Q3” may already be outdated, delayed, or completed. Without additional context, an LLM has no way to know whether a statement is still valid.
Second, PDFs lack operational signals. They rarely contain information such as current stage, ownership, active blockers, or recent comments from account teams. These details often live in structured systems like Salesforce, tracking tools, or curated use-case records, not in presentation decks.
Third, PDFs tend to mix signal and noise. A single document may include background material, historical context, exploratory ideas, and tentative plans alongside actual decisions. Even with good chunking, an LLM cannot reliably infer which parts represent commitments versus discussion without external guidance.
This is where structured context becomes critical.
In our case, structured context is represented by use-case records coming from an ETL pipeline out of Salesforce. These records provide a different kind of information: current stage, business priority, ownership, timelines, and recent updates. They act as a source of truth for what is supposed to be happening now, while PDFs provide evidence for how and why decisions were made.
In the newsletter pipeline, unstructured PDFs are treated as evidence, not authority. Use-case data is treated as context that frames interpretation. When the two agree, PDFs add detail and credibility. When they conflict, the structured context takes priority, and the discrepancy itself may be surfaced as a risk or action item.
This pattern generalizes to most production AI systems. Structured data — such as system-of-record fields, timestamps, ownership, and status — should define what is true, while unstructured data provides why it matters. Treating unstructured inputs as authoritative leads to stale conclusions and subtle hallucinations. Treating them as supporting evidence preserves correctness while still capturing nuance and narrative depth.
This separation is intentional. Asking an LLM to reconcile stale documents and current reality on its own leads to hallucination or false confidence. By explicitly combining unstructured evidence with structured context, the system constrains the model to generate outputs that reflect both historical grounding and current state.
The rest of the pipeline builds on this idea: keep evidence and context separate, and only merge them at controlled points in the generation flow.
To produce LLM output that is both useful and trustworthy, the pipeline follows a Map → Reduce → Align pattern.
This structure borrows ideas from classic distributed systems, but applies them to reasoning control, not compute scale. Instead of letting the model reason freely over large inputs, each stage has a narrow, explicit responsibility.
At a high level:
This turns the LLM from a free-form writer into a constrained reasoning engine.
The Map stage operates on small, isolated inputs:
Each input is processed independently by a structured LLM call.
The objective is not to summarize the document.
The objective is to extract atomic, source-bound statements that may be useful downstream.
Each chunk is sent to the model with:
The model must return structured JSON — not narrative prose.
Below is a simplified excerpt of actual Map-stage output before reduction:
{
"Next Steps": [
"RFI will be sent for Agent Governance [UC: EAI – Lighthouse Pricing Forecast]",
"Continue onboarding critical use cases and enhance features [FILE: Q1FY26_Strategy_QBR_XYZ.pdf][PAGE 10]",
"Address technical blockers and ensure legal terms for LLM services are resolved [UC: EAI – Lighthouse Pricing Forecast]"
],
"Blockers": []
}
Each entry:
This is raw signal — not a summary.
Because provenance is attached at creation time, the model never needs to “remember” where a statement came from. Source binding is enforced early in the pipeline rather than reconstructed later.
This design dramatically reduces hallucination risk. Map is constrained to extraction, not interpretation. Prioritization and synthesis are intentionally deferred to the Reduce stage.
Once all Map outputs are collected, candidates are grouped by section intent, such as:
SECTION_KEYS = [
"Executive Summary",
"Value Proposition",
"Risk Assessment",
"Macrotrends",
"Next Steps",
]
Each section receives only the candidates relevant to its purpose.
The Reduce step allows the LLM to:
Reduce is not allowed to invent new facts. It can only rephrase, combine, or de-emphasize existing Map outputs.
Different sections apply different reduction rules:
These behaviors are enforced through section-specific prompts, not post-processing.
The Align step is what makes the output safe to share.
Here, hard guardrails are applied:
For example:
If Reduce decides what to say, Align decides what is allowed to be said.
In practice, this eliminates:
Prompt engineering is not just wording. In production, prompts act like APIs: they define contracts, scope, and allowed outputs.
In this pipeline we use three prompt layers. Each layer has one job.
In practice, each Map, Reduce, and Align step is executed via Databricks Model Serving or an internal LLM endpoint; the architectural pattern is independent of the specific invocation mechanism.
The system prompt sets non-negotiables that are shared across stages:
Example (simplified from our implementation):
You are an enterprise analyst assisting an account team.
Use ONLY the provided corpus (PDF docs + Use Case notes).
Do NOT use outside knowledge. Do NOT hallucinate.
CITATIONS (MANDATORY)
- Every sentence/bullet must end with exactly one citation tag:
[UC: <usecase_name>] OR [FILE: <name>][PAGE <X>] (OCR allowed)
- If you need two sources, split into two bullets.
- Do NOT combine sources in one sentence.
This prompt is intentionally boring. It exists to prevent drift.
Map runs on a single bounded chunk of evidence (typically a few pages of a PDF, or a small slice of text). The goal is not to summarize. The goal is to extract small, cited candidate statements that can be reduced later.
Map outputs JSON keyed by section, so downstream stages are deterministic:
In our implementation, we also change the allowed citation templates depending on whether the chunk is file-backed:
Example Map prompt structure (simplified):
PDF CHUNK (3/12):
<chunk_text>
USE-CASE SNAPSHOT:
<use_case_context (truncated)>
OUTPUT
- Return a JSON object with exactly these keys: ["Executive Summary", "Risk Assessment", ...]
- Each key maps to an array of 0–3 strings
- Each string must end with exactly ONE citation tag
In practice, this format can be enforced using structured output capabilities at the model-serving layer (e.g., schema-constrained JSON responses). Rather than relying solely on prompt instructions, the serving endpoint validates that responses conform to the expected schema. This ensures deterministic structure and prevents extra keys or malformed outputs from propagating downstream.
A key design choice: Map is allowed to be redundant. We prefer “too many cited candidates” over “missing signal.”
Reduce runs per section, across the candidates collected from all Map calls.
Before calling the model, we pre-filter aggressively:
Then Reduce asks the LLM to consolidate:
Reduce produces the final section text (summary paragraph + bullets), but still under strict constraints.
For document-heavy sources like QBRs or product manuals, we treat recency and version as first-class signals during pre-filtering. When multiple candidates express the same idea, we prefer statements from the latest document version (or the most recent “as-of” date) and drop older variants unless they add unique context.
Example: if a QBR from Oct 2025 and a QBR from Jan 2026 both mention the same migration risk, we keep the newer cited statement (Jan 2026) and discard the older one unless it contains details not present in the latest version. This prevents the final section from reflecting outdated status while still preserving traceability.
Different sections in the newsletter require different policies. Rather than relying on post-processing filters, we encode these rules directly in prompts, making behavior explicit and explainable.
Examples from our pipeline:
Encoding these constraints in prompts reduces cross-section leakage and keeps alignment transparent.
While the current implementation enforces these rules at the prompt level, we are exploring systematic evaluation using an LLM-as-a-judge approach.
In an internal prototype, we tested automated checks such as:
This moves section-specific rules closer to “Unit tests for generative AI systems”.
Although not yet part of the production pipeline, early results indicate that these judges can reliably flag violations. Integrating this evaluation into MLflow or other Databricks-native tooling would allow alignment behavior to be measured, tracked, and regression-tested over time.
Citations are not a formatting detail. They are a control mechanism.
In this pipeline, every claim must be traceable to a concrete source. If a statement cannot be grounded, it does not appear in the output. This rule is enforced structurally, not retroactively.
All provenance flows through a single structure: doc_index.
doc_index is built during preprocessing and catalogs every source involved in a run, including both PDFs and structured use-case records. Each entry has a stable identifier that is used directly in citations.
Example:
[
{
"source_type": "pdf",
"source_id": "7",
"file": "architecture-review-notes.pdf"
},
{
"source_type": "use_case",
"source_id": "Pricing Forecast Modernization",
"title": "Pricing Forecast Modernization"
}
]
The LLM is never allowed to invent identifiers. It may only emit citation tags that map directly to entries in doc_index.
The model is constrained to exactly two citation formats:
PDF evidence
[FILE: <name>][PAGE <X>] (OCR allowed)
Structured context
[UC: <usecase_name>]
Every sentence or bullet must end with exactly one citation.
If a claim requires multiple sources, it must be split into multiple bullets.
Citation enforcement starts before synthesis.
Map outputs must include inline citations. Reduce drops any uncited candidates before consolidation:
bullets = [b for b in bullets if has_cite_strict(b)]
This prevents hallucinated statements from propagating downstream.
References are rendered directly from doc_index, not reconstructed from generated text.
At the end of a run, the system produces a ReferenceSources section listing all PDFs and use cases involved. This works consistently for PDF-only, use-case-only, and mixed runs.
By making lineage structural—through data models, allowed formats, and early filtering—trustworthiness becomes a property of the pipeline, not the model.
Once lineage and enforcement are structural, the next practical constraint becomes unavoidable: token and context budgets.
Token limits are not merely a scaling concern — they directly impact correctness.
Most LLM failures in multi-source systems come from trying to fit too much into a single prompt. Important evidence gets dropped, context crowds out signal, or the model silently ignores parts of the input. The pipeline avoids this by treating evidence and context differently and by bounding each stage explicitly.
Unstructured evidence, such as PDFs or notes, is chunked early. Each chunk is processed independently in the Map stage, with no awareness of other chunks. This keeps prompts small and predictable, and prevents any single document from dominating the output.
Structured context, such as use-case records, is handled differently. It is summarized once, capped to a fixed size, and reused across Map and Reduce. This ensures that current state and ownership information is always present, without being drowned out by raw text.
In practice, this creates a clear split: evidence is distributed across many small prompts, while context is shared and controlled.
Each stage also has explicit limits and a clear responsibility.
These constraints are enforced in code, not left to prompt wording alone. This makes boundedness and structure properties of the pipeline itself, rather than emergent behavior of the model.
A simplified version of this pattern looks like this:
# Evidence layer: chunked and distributed
# Purpose: bound raw source material into independently processable units
chunks = chunk_text(pdf_text, max_chars=12000)
# Context layer: summarized once and reused
# Purpose: provide stable, shared context without repeated token cost
uc_context = summarize_use_cases(use_cases, max_chars=5000)
# MAP: extract atomic insights from each chunk
# Purpose: local reasoning, no cross-chunk aggregation
map_outputs = [
llm_map(chunk, uc_context)
for chunk in chunks
]
# REDUCE + ALIGN: aggregate, deduplicate, and format per section
# Purpose: global reasoning under strict section-level constraints
final_sections = {}
for section in SECTION_KEYS:
candidates = collect_candidates(map_outputs, section)
final_text = llm_reduce_align(section, candidates, uc_context)
final_sections[section] = final_text
Not all sections receive the same inputs. Executive summaries prioritize high-level signals. Risk and blocker sections tolerate uncertainty but require grounding. Macrotrends intentionally ignore use-case context and rely only on PDF-backed evidence. Scoping inputs this way reduces token waste and prevents cross-section leakage.
The result is a system that behaves predictably as inputs grow. When something is missing, it is usually because it was filtered, deduplicated, or out of scope — not because the model guessed wrong.
Across all stages, the goal is the same: reduce the model’s need to guess. By bounding inputs and enforcing structure, the pipeline stays stable, explainable, and suitable for decision-facing content.
The main lesson from this system is simple: reliable LLM output comes from structure, not clever prompts.
Across the pipeline, the model is never asked to do everything at once. Evidence is separated from context. Extraction is separated from synthesis. Synthesis is separated from enforcement. Each step narrows the model’s responsibility and reduces the need to guess.
The newsletter example is just one instance of this pattern. The same architecture applies to any decision-facing content generated from multiple sources: executive updates, operational reviews, risk summaries, or internal status reports. Wherever correctness, provenance, and repeatability matter, structure becomes the foundation.
What makes this work is not the choice of model or the exact wording of prompts. It is the combination of:
This shifts the role of the LLM. Instead of acting as a free-form writer, it becomes a controlled reasoning component inside a larger system.
Structure alone is not enough. Once a system behaves correctly, the next challenge is ensuring it continues to behave correctly as models, prompts, and data evolve.
In the next part of this series, we will focus on:
Moving from prototype to production requires treating generative pipelines as software systems: versioned, evaluated, deployable, and observable.
Better prompts help.
But durable AI systems require architecture, evaluation, and disciplined production practices.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.