cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
gxzhao-db
Databricks Employee
Databricks Employee

In Part 1 of this series, we focused on parsing enterprise PDFs, and this blog we build on that work and show how those chunks are combined with structured use-case data to produce consistent, account-ready outputs using LLMs.

Large language models (LLMs) are increasingly used to generate decision-making content from a mix of unstructured documents and structured system data. In practice, this often breaks down once correctness, groundness, and repeatability start to matter.

This post describes a production-oriented pattern for building reliable LLM pipelines across multiple sources of truth. While the concrete example is an account newsletter, the architecture generalizes to any LLM-driven synthesis where evidence, context, and output quality must be controlled.

The focus is not on model choice or prompt tricks, but on system structure: how inputs are separated, how reasoning is staged, and how constraints are enforced end-to-end.

 

1. What “Useful” Means for LLM-Generated Content

Before talking about architecture, it’s worth defining what “useful” means for LLM-generated content.

In most enterprise environments, data is not scarce. Teams already have PDFs from workshops and reviews, CRM or Salesforce use cases, dashboards, and many ad-hoc notes. The problem is not collecting more information. It is turning scattered, noisy inputs into a short, reliable view of what is happening and what needs attention.

The AI-generated account newsletter used in this series is one concrete example of this broader problem. It represents a common LLM workload: combining multiple sources into a structured summary that highlights priorities, risks, and actions. The same pattern shows up in executive updates, operational reviews, and internal status reports.

For this class of use cases, a useful LLM output must meet three criteria.

Actionable
It should surface what matters now: key initiatives, risks, blockers, and next steps.

Trustworthy
Every statement must be traceable to a real source, such as a specific PDF page or a structured record. If something cannot be grounded, it should not be included.

Consistent
The same inputs should produce similar outputs across runs. Results should not depend on luck with prompts or manual cleanup. 

These constraints shaped the pipeline design. Instead of asking an LLM to “summarize everything,” the system enforces structure, prioritization, and citation rules. The goal is not fluent text, but a reliable signal — content that can be read quickly and trusted.

This design aligns with emerging prompt-optimization approaches such as DSPy and MLflow’s Prompt Optimizer, which treat prompts and LLM programs as objects that can be evaluated, tuned, and versioned. Rather than relying on trial-and-error prompting, the system is built to be measurable and reproducible.

 

2. Join with structured use-case data to produce consistent, account-ready outputs

Enterprise Documents (typically in the form of PDFs) are a rich source of information. Workshop decks, QBR slides, architecture reviews, and meeting notes often contain important decisions, risks, and technical details. In Part 1, we focused on turning these documents into structured, citeable chunks that an LLM can safely consume.

However, PDFs alone are not enough to generate useful, decision-facing content.

First, PDFs are inherently backward-looking. They capture what was discussed at a point in time, but they do not reliably reflect current status. A slide stating “target go-live in Q3” may already be outdated, delayed, or completed. Without additional context, an LLM has no way to know whether a statement is still valid.

Second, PDFs lack operational signals. They rarely contain information such as current stage, ownership, active blockers, or recent comments from account teams. These details often live in structured systems like Salesforce, tracking tools, or curated use-case records, not in presentation decks.

Third, PDFs tend to mix signal and noise. A single document may include background material, historical context, exploratory ideas, and tentative plans alongside actual decisions. Even with good chunking, an LLM cannot reliably infer which parts represent commitments versus discussion without external guidance.

This is where structured context becomes critical.

In our case, structured context is represented by use-case records coming from an ETL pipeline out of Salesforce. These records provide a different kind of information: current stage, business priority, ownership, timelines, and recent updates. They act as a source of truth for what is supposed to be happening now, while PDFs provide evidence for how and why decisions were made.

In the newsletter pipeline, unstructured PDFs are treated as evidence, not authority. Use-case data is treated as context that frames interpretation. When the two agree, PDFs add detail and credibility. When they conflict, the structured context takes priority, and the discrepancy itself may be surfaced as a risk or action item.

​​This pattern generalizes to most production AI systems. Structured data — such as system-of-record fields, timestamps, ownership, and status — should define what is true, while unstructured data provides why it matters. Treating unstructured inputs as authoritative leads to stale conclusions and subtle hallucinations. Treating them as supporting evidence preserves correctness while still capturing nuance and narrative depth.

This separation is intentional. Asking an LLM to reconcile stale documents and current reality on its own leads to hallucination or false confidence. By explicitly combining unstructured evidence with structured context, the system constrains the model to generate outputs that reflect both historical grounding and current state.

The rest of the pipeline builds on this idea: keep evidence and context separate, and only merge them at controlled points in the generation flow.

 

3. The Map → Reduce → Align Pattern

To produce LLM output that is both useful and trustworthy, the pipeline follows a Map → Reduce → Align pattern.

This structure borrows ideas from classic distributed systems, but applies them to reasoning control, not compute scale. Instead of letting the model reason freely over large inputs, each stage has a narrow, explicit responsibility.

At a high level:

  • Map extracts atomic, cited candidates from small inputs
  • Reduce consolidates those candidates per section
  • Align enforces correctness, citations, and format rules

This turns the LLM from a free-form writer into a constrained reasoning engine.

Architecture at a glance

 

gxzhaodb_0-1769727550710.png

 

3.1 Map: Extract Atomic, Cited Candidates

The Map stage operates on small, isolated inputs:

  • A few pages of a PDF
  • A single meeting note
  • A bounded slice of use-case context

Each input is processed independently by a structured LLM call.

The objective is not to summarize the document.
The objective is to extract atomic, source-bound statements that may be useful downstream.

Each chunk is sent to the model with:

  • A constrained prompt
  • A predefined JSON schema (e.g., Blockers, Next Steps, Risks, Initiatives)
  • Strict citation requirements

The model must return structured JSON — not narrative prose.

Example Map Output

Below is a simplified excerpt of actual Map-stage output before reduction:

{

  "Next Steps": [

    "RFI will be sent for Agent Governance [UC: EAI – Lighthouse Pricing Forecast]",

    "Continue onboarding critical use cases and enhance features [FILE: Q1FY26_Strategy_QBR_XYZ.pdf][PAGE 10]",

    "Address technical blockers and ensure legal terms for LLM services are resolved [UC: EAI – Lighthouse Pricing Forecast]"

  ],

  "Blockers": []

}

Each entry:

  • Is self-contained
  • Includes inline provenance
  • Originates from a single chunk
  • Has not yet been deduplicated or prioritized

This is raw signal — not a summary.

Key Properties of Map Output

  • Each item stands alone
  • Each item includes an inline citation
  • No cross-chunk reasoning occurs
  • Output is structured (JSON), not narrative

Because provenance is attached at creation time, the model never needs to “remember” where a statement came from. Source binding is enforced early in the pipeline rather than reconstructed later.

This design dramatically reduces hallucination risk. Map is constrained to extraction, not interpretation. Prioritization and synthesis are intentionally deferred to the Reduce stage.

3.2 Reduce: section-aware consolidation

Once all Map outputs are collected, candidates are grouped by section intent, such as:

SECTION_KEYS = [
    "Executive Summary",
    "Value Proposition",
    "Risk Assessment",
    "Macrotrends",
    "Next Steps",
]

Each section receives only the candidates relevant to its purpose.

The Reduce step allows the LLM to:

  • Deduplicate similar items
  • Merge closely related statements
  • Prioritize clearer or more recent signals

Reduce is not allowed to invent new facts. It can only rephrase, combine, or de-emphasize existing Map outputs.

Different sections apply different reduction rules:

  • Executive Summary favors momentum and priorities
  • Risk Assessment favors blockers and uncertainty
  • Macrotrends may ignore use cases and rely only on PDF-backed evidence

These behaviors are enforced through section-specific prompts, not post-processing.

3.3 Align: enforce truth, citations, and format

The Align step is what makes the output safe to share.

Here, hard guardrails are applied:

  • Every sentence must include exactly one allowed citation
  • Citations must match approved formats (PDF page or use case)
  • Section-specific format rules are enforced (bullets vs paragraphs, length limits)
  • Stale or conflicting statements are softened, flagged, or dropped

For example:

  • Macrotrends cannot reference internal use cases
  • Risk items older than a defined window must be rewritten or excluded
  • Output must not include analysis or reasoning artifacts

If Reduce decides what to say, Align decides what is allowed to be said.

In practice, this eliminates:

  • Uncited claims
  • Reasoning leakage
  • Accidental cross-section contamination

 

4. Prompt Design: System vs Map vs Reduce/Align

Prompt engineering is not just wording. In production, prompts act like APIs: they define contracts, scope, and allowed outputs.

In this pipeline we use three prompt layers. Each layer has one job.

In practice, each Map, Reduce, and Align step is executed via Databricks Model Serving or an internal LLM endpoint; the architectural pattern is independent of the specific invocation mechanism.

4.1 System prompt: global constraints (applies everywhere)

The system prompt sets non-negotiables that are shared across stages:

  • use only the provided corpus (PDFs + use cases)
  • do not hallucinate or use outside knowledge
  • citations are mandatory and must be inline
  • keep language concise and executive-readable

Example (simplified from our implementation):

You are an enterprise analyst assisting an account team.

Use ONLY the provided corpus (PDF docs + Use Case notes).
Do NOT use outside knowledge. Do NOT hallucinate.

CITATIONS (MANDATORY)
- Every sentence/bullet must end with exactly one citation tag:
  [UC: <usecase_name>] OR [FILE: <name>][PAGE <X>] (OCR allowed)
- If you need two sources, split into two bullets.
- Do NOT combine sources in one sentence.

This prompt is intentionally boring. It exists to prevent drift.

4.2 Map prompt: extract atomic candidates from one chunk

Map runs on a single bounded chunk of evidence (typically a few pages of a PDF, or a small slice of text). The goal is not to summarize. The goal is to extract small, cited candidate statements that can be reduced later.

Map outputs JSON keyed by section, so downstream stages are deterministic:

  • each key maps to 0–3 short strings
  • each string must include exactly one citation inline
  • no extra keys allowed

In our implementation, we also change the allowed citation templates depending on whether the chunk is file-backed:

  • If the chunk comes from a PDF page, [FILE: …][PAGE …] is allowed
  • If there are no PDFs in the run, we force [UC: …] citations only

Example Map prompt structure (simplified):

PDF CHUNK (3/12):
<chunk_text>

USE-CASE SNAPSHOT:
<use_case_context (truncated)>

OUTPUT
- Return a JSON object with exactly these keys: ["Executive Summary", "Risk Assessment", ...]
- Each key maps to an array of 0–3 strings
- Each string must end with exactly ONE citation tag

In practice, this format can be enforced using structured output capabilities at the model-serving layer (e.g., schema-constrained JSON responses). Rather than relying solely on prompt instructions, the serving endpoint validates that responses conform to the expected schema. This ensures deterministic structure and prevents extra keys or malformed outputs from propagating downstream.

A key design choice: Map is allowed to be redundant. We prefer “too many cited candidates” over “missing signal.”

4.3 Reduce prompt: consolidate per section (no new facts)

Reduce runs per section, across the candidates collected from all Map calls.

Before calling the model, we pre-filter aggressively:

  • drop uncited candidates
  • dedupe near-duplicates
  • cap list size to control tokens

Then Reduce asks the LLM to consolidate:

  • deduplicate and merge closely related points
  • keep only the strongest items
  • preserve citations
  • do not invent new claims

Reduce produces the final section text (summary paragraph + bullets), but still under strict constraints.

Time and version awareness

For document-heavy sources like QBRs or product manuals, we treat recency and version as first-class signals during pre-filtering. When multiple candidates express the same idea, we prefer statements from the latest document version (or the most recent “as-of” date) and drop older variants unless they add unique context.

Example: if a QBR from Oct 2025 and a QBR from Jan 2026 both mention the same migration risk, we keep the newer cited statement (Jan 2026) and discard the older one unless it contains details not present in the latest version. This prevents the final section from reflecting outdated status while still preserving traceability.

4.4 Align Behavior: Enforcing Section-Specific Rules

Different sections in the newsletter require different policies. Rather than relying on post-processing filters, we encode these rules directly in prompts, making behavior explicit and explainable.

Examples from our pipeline:

  • Macrotrends
    • PDF-only citations
    • No use-case references

  • Next Steps
    • Add a “Go-Live Readiness Actions” sub-list
    • Only when supported by cited evidence

  • Freshness
    • Rewrite stale evidence as time-bounded (for example, “As of <date> …”)

Encoding these constraints in prompts reduces cross-section leakage and keeps alignment transparent.

Looking Ahead: Evaluating Alignment with LLM-as-a-Judge

While the current implementation enforces these rules at the prompt level, we are exploring systematic evaluation using an LLM-as-a-judge approach.

In an internal prototype, we tested automated checks such as:

  • Detecting use-case leakage into the Macrotrends section
  • Verifying that citations originate from the correct source type
  • Ensuring stale evidence is rewritten with explicit time boundaries
  • Confirming that Go-Live actions appear only when supported by sources

This moves section-specific rules closer to “Unit tests for generative AI systems”.

Although not yet part of the production pipeline, early results indicate that these judges can reliably flag violations. Integrating this evaluation into MLflow or other Databricks-native tooling would allow alignment behavior to be measured, tracked, and regression-tested over time.

 

5. Citation Enforcement and Provenance

Citations are not a formatting detail. They are a control mechanism.

In this pipeline, every claim must be traceable to a concrete source. If a statement cannot be grounded, it does not appear in the output. This rule is enforced structurally, not retroactively.

5.1 doc_index as the source of truth

All provenance flows through a single structure: doc_index.

doc_index is built during preprocessing and catalogs every source involved in a run, including both PDFs and structured use-case records. Each entry has a stable identifier that is used directly in citations.

Example:

[
  {
    "source_type": "pdf",
    "source_id": "7",
    "file": "architecture-review-notes.pdf"
  },
  {
    "source_type": "use_case",
    "source_id": "Pricing Forecast Modernization",
    "title": "Pricing Forecast Modernization"
  }
]

 The LLM is never allowed to invent identifiers. It may only emit citation tags that map directly to entries in doc_index.

5.2 Allowed citation formats

The model is constrained to exactly two citation formats:

PDF evidence

[FILE: <name>][PAGE <X>]   (OCR allowed)

Structured context

 [UC: <usecase_name>]

Every sentence or bullet must end with exactly one citation.
If a claim requires multiple sources, it must be split into multiple bullets.

5.3 Enforcing citations early

Citation enforcement starts before synthesis.

Map outputs must include inline citations. Reduce drops any uncited candidates before consolidation:

bullets = [b for b in bullets if has_cite_strict(b)]

This prevents hallucinated statements from propagating downstream.

5.4 Rendering references

References are rendered directly from doc_index, not reconstructed from generated text.

At the end of a run, the system produces a ReferenceSources section listing all PDFs and use cases involved. This works consistently for PDF-only, use-case-only, and mixed runs.

5.5 Why this matters

By making lineage structural—through data models, allowed formats, and early filtering—trustworthiness becomes a property of the pipeline, not the model.

Once lineage and enforcement are structural, the next practical constraint becomes unavoidable: token and context budgets.

 

6. Token and Context Budget Management

Token limits are not merely a scaling concern — they directly impact correctness.

Most LLM failures in multi-source systems come from trying to fit too much into a single prompt. Important evidence gets dropped, context crowds out signal, or the model silently ignores parts of the input. The pipeline avoids this by treating evidence and context differently and by bounding each stage explicitly.

Unstructured evidence, such as PDFs or notes, is chunked early. Each chunk is processed independently in the Map stage, with no awareness of other chunks. This keeps prompts small and predictable, and prevents any single document from dominating the output.

Structured context, such as use-case records, is handled differently. It is summarized once, capped to a fixed size, and reused across Map and Reduce. This ensures that current state and ownership information is always present, without being drowned out by raw text.

In practice, this creates a clear split: evidence is distributed across many small prompts, while context is shared and controlled.

Each stage also has explicit limits and a clear responsibility.

  • MAP extracts short, atomic insights from both structured and unstructured inputs. Each map call operates on a bounded chunk and emits candidates small enough to be independently evaluated.
  • REDUCE aggregates and deduplicates those candidates across all chunks, ensuring the model reasons over a constrained, already-filtered set rather than raw source material.
  • ALIGN enforces strict length, format, and semantic rules per output section, turning aggregated insights into publishable, section-specific results.

These constraints are enforced in code, not left to prompt wording alone. This makes boundedness and structure properties of the pipeline itself, rather than emergent behavior of the model.

A simplified version of this pattern looks like this: 

# Evidence layer: chunked and distributed
# Purpose: bound raw source material into independently processable units
chunks = chunk_text(pdf_text, max_chars=12000)

# Context layer: summarized once and reused
# Purpose: provide stable, shared context without repeated token cost
uc_context = summarize_use_cases(use_cases, max_chars=5000)

# MAP: extract atomic insights from each chunk
# Purpose: local reasoning, no cross-chunk aggregation
map_outputs = [
    llm_map(chunk, uc_context)
    for chunk in chunks
]

# REDUCE + ALIGN: aggregate, deduplicate, and format per section
# Purpose: global reasoning under strict section-level constraints
final_sections = {}

for section in SECTION_KEYS:
    candidates = collect_candidates(map_outputs, section)
    final_text = llm_reduce_align(section, candidates, uc_context)
    final_sections[section] = final_text

Not all sections receive the same inputs. Executive summaries prioritize high-level signals. Risk and blocker sections tolerate uncertainty but require grounding. Macrotrends intentionally ignore use-case context and rely only on PDF-backed evidence. Scoping inputs this way reduces token waste and prevents cross-section leakage.

The result is a system that behaves predictably as inputs grow. When something is missing, it is usually because it was filtered, deduplicated, or out of scope — not because the model guessed wrong.

Across all stages, the goal is the same: reduce the model’s need to guess. By bounding inputs and enforcing structure, the pipeline stays stable, explainable, and suitable for decision-facing content.

 

7. Putting It Together: From Structure to Production Systems

The main lesson from this system is simple: reliable LLM output comes from structure, not clever prompts.

Across the pipeline, the model is never asked to do everything at once. Evidence is separated from context. Extraction is separated from synthesis. Synthesis is separated from enforcement. Each step narrows the model’s responsibility and reduces the need to guess.

The newsletter example is just one instance of this pattern. The same architecture applies to any decision-facing content generated from multiple sources: executive updates, operational reviews, risk summaries, or internal status reports. Wherever correctness, provenance, and repeatability matter, structure becomes the foundation.

What makes this work is not the choice of model or the exact wording of prompts. It is the combination of:

  • explicit stages (Map → Reduce → Align)
  • bounded inputs and outputs
  • structural citation enforcement
    clear separation between evidence and current context

This shifts the role of the LLM. Instead of acting as a free-form writer, it becomes a controlled reasoning component inside a larger system.

Toward Evaluation and Production Readiness

Structure alone is not enough. Once a system behaves correctly, the next challenge is ensuring it continues to behave correctly as models, prompts, and data evolve.

In the next part of this series, we will focus on:

  • Evaluating alignment behavior using LLM-as-a-judge
  • Tracking regression and drift with MLflow-based evaluation
  • Packaging the pipeline using Databricks Asset Bundles (DAB)
  • Deploying the newsletter as a Databricks App
  • Operating the system at scale with controlled deployment workflows

Moving from prototype to production requires treating generative pipelines as software systems: versioned, evaluated, deployable, and observable.

Better prompts help.

But durable AI systems require architecture, evaluation, and disciplined production practices.