Enterprise Account macro trends, strategy doc, account and project updates often end up in PDF format. Meanwhile, usage metrics and account-level signals—such as active users, DBU consumption, and use case stages—live in Delta tables sourced from systems like Salesforce. When teams need to communicate the health of an initiative or customer account, they typically combine both: structured use case data and unstructured context from documents. This workflow is valuable, but difficult to scale. A repeatable pipeline must pull structured data, parse PDFs, normalize their content, and combine them before any AI model can summarize them reliably. This blog introduces a reusable pattern for doing this on Databricks.
The processing pattern looks like this:
In this series, the AI newsletter is powered by two sources: a structured use case execution summary table built from Salesforce, and unstructured PDFs uploaded by the account team. This article focuses on the PDF side of the pipeline, but the newsletter is always driven by both.
PDFs are the most common enterprise format, but the least model-friendly. PDFs mix different layouts and content types: text blocks, tables, charts, vector text, and scanned pages. Some pages require OCR, while others contain repeated headers or multi-column formatting that disrupt summarization accuracy. To use PDFs in an AI pipeline, they must first be transformed into a consistent, section-level structure.
In real workloads, an account or project may include dozens of PDFs. The parsing stage normalizes all of them into a consistent structure so they can be combined and summarized together.
On Databricks, a parsing stage can normalize PDFs by:
This produces an AI-ready representation stored as JSON artifacts in Volumes. After summarization, only the final curated elements are written to Delta. The pipeline reuses the parsed artifacts without re-processing PDFs, enabling scalable summarization.
Once PDFs are normalized into structured chunks, they can be combined with metrics and summarized into meaningful categories. For example:
{
"Executive Summary": [
"New cost-optimization initiative targets operational savings over the next fiscal year [FILE: Strategy_Update_Q3.pdf][PAGE 2]"
],
"Value Proposition": [
"Unified analytics and governance platform reduces integration overhead [FILE: Architecture_Planning_Report.pdf][PAGE 5]"
],
"Competitive Insights": [
"Customer continues evaluating multiple analytics vendors for migration [FILE: Vendor_Compliance_Notes.pdf][PAGE 4]"
],
"Risk Assessment": [
"Several onboarding tasks remain pending and may affect deployment milestones [FILE: Platform_Readiness_Checklist.pdf][PAGE 3]"
],
"Next Steps": [
"Finalize architecture review and confirm migration scope [FILE: Roadmap_Alignment_Meeting.pdf][PAGE 7]"
]
…
}
These summaries are generated after parsing, which will be the focus of Blog 2 in this series.
The newsletter pipeline uses a two-stage process that separates parsing from summarization. This keeps the workflow modular, scalable, and easier to maintain. In this stage, PDFs are only normalized and stored in a governed format, not summarized.
To reliably extract text from both digital and scanned PDFs, the parsing job uses a combination of Python libraries:
|
Task |
Tool |
|
Text extraction for digital PDFs |
pdfplumber |
|
Extracting page images and layout |
pypdfium2 |
|
OCR for scanned content |
easyocr + Pillow |
|
Cleaning noisy text |
rapidfuzz |
This mix allows Databricks to normalize messy enterprise PDFs into machine-readable chunks that can later be summarized.
Volumes store raw user-uploaded PDFs and all parsing artifacts (OCR text, merged candidates, section JSON, prompt packs).
Parsing Job extracts text, applies OCR when needed, normalizes headers/footers, and segments content into reusable sections.
Delta Tables in Unity Catalog store only the final curated newsletter intelligence that downstream tools rely on (app display, metrics, video generation).
Summarization Stage (next blog) reads both parsed artifacts from Volumes and structured Salesforce use case data, then generates final newsletter content stored in Delta.
This separation ensures that parsing runs once per document, while downstream jobs can be rerun on demand without touching the PDFs again.
Alongside PDFs, the newsletter pipeline relies on a structured view of use cases for each account. This is materialized as a use case execution summary table in Unity Catalog, built from Salesforce and GTM datasets.
At a high level, the ETL job:
The result is a Delta table (for example, usecase_exec_summary). This table acts as the structured backbone of the newsletter: it tells you what use cases are in flight, where they are in the lifecycle, how much impact they have, and what’s blocking them.
When combined with the PDF parsing output, the newsletter can surface:
Blog 2 will show how these two inputs—structured use case summaries and parsed PDF sections—are merged, ranked, cited, and summarized into a single AI-generated newsletter.
Enterprise PDFs are inconsistent: some pages contain extractable digital text, others are scanned documents that require OCR. A reliable pipeline therefore needs a digital-first approach with selective OCR fallback, and it should store the results in a reusable, governed format.
The goal of the parsing stage is not to summarize documents.It is to normalize many PDFs into structured sections so they can be combined, cited, and summarized later.
Account teams frequently upload multiple files. Rather than processing a single document, the pipeline reads a folder of PDFs for each account:
PDF_ROOT = "/Volumes/main/newsletter_demo/uploaded_docs"
def list_pdfs(account_id):
base = f"{PDF_ROOT}/customer_{account_id}"
return [f"{base}/{f}" for f in os.listdir(base) if f.endswith(".pdf")]
Idea: Treat an account as a collection of documents, not one file.
To reduce cost and avoid noisy OCR output, the pipeline attempts text extraction first. Only if the text is too sparse does it fall back to OCR:
def extract_page(pdf_path, page_i, ocr):
text = pdfplumber.open(pdf_path).pages[page_i].extract_text() or ""
if len(text.strip()) > 40: # simple quality threshold
return text, "digital" # good digital text detected
# fallback for scanned pages
img = pdfium.PdfDocument(pdf_path)[page_i].render(scale=2).to_pil()
return "\n".join(ocr.readtext(img, detail=0)), "ocr"
Idea: OCR should be a fallback, not the default.
After parsing, the pipeline stores normalized text, OCR output, and section metadata as JSON artifacts in a Databricks Volume. These artifacts act as reusable inputs for summarization. Only final newsletter elements are written to Delta tables in Unity Catalog, not the raw parsed sections.
Why This Matters
By storing normalized sections in Volumes, the parsing stage becomes reusable across multiple workflows. After this step, downstream jobs can:
PDFs → Parsed Artifacts (Volumes) → Summarization & Insights → Final Delta Outputs
The parsing stage creates structured data, not summaries.
Summarization is the focus of the next blog in this series.
After parsing, the pipeline produces multiple artifacts:
clean digital text, OCR fallback text, combined files, stats, and model candidate JSON. These are stored in a Databricks Volume, not in a Delta table. They serve as reproducible and inspectable raw AI inputs for the intelligence stage.
Volumes (raw + intermediate artifacts)
├── *.digital_text.txt
├── *.ocr_text.txt
├── combined_text.txt
├── stats.json
├── prompt_pack.json
└── llm_merged_candidates.json
Each parsing run keeps reusable evidence in a UC Volume. The files serve different purposes:
|
Artifact |
Description |
Purpose |
|
*.digital_text.txt |
Extracted text from native (non-scanned) PDF pages |
Fast baseline parsing without OCR |
|
*.ocr_text.txt |
OCR text extracted from scanned pages or images |
Fallback when the document is not machine-readable |
|
combined_text.txt |
A merge of digital + OCR text, after cleanup |
The canonical source for LLM tasks |
|
stats.json |
Page count, OCR ratio, estimated chunk count, token cost estimates |
Helps tune chunking size + LLM cost |
|
prompt_pack.json |
The exact prompts used for summarization |
Enables reproducibility + auditability |
|
Llm_merged_candidates.json (blog2) |
Section-level responses before deduplication and reduction |
Used for Map-Reduce and debugging hallucinations |
This enables multi-use outputs from the same intelligence:
Delta (Newsletter Elements)
↓
AI Newsletter App
Video Generation Narrative
Account Review Dashboards
Search + Trend Extraction
Insight: In AI pipelines, Volumes preserve lineage and evidence, while Delta offers final AI intelligence for the business.
A successful AI newsletter doesn’t start with summarization—it starts with structure. By normalizing PDFs into governed JSON and Delta tables, the pipeline becomes reliable, reusable, and auditable. This separation ensures that document intelligence is stored once and can power many downstream products, from newsletters to dashboards to video summaries. Clean structure makes AI predictable, not random.
In the next article, we’ll dive into how the summarization stage works at scale on Databricks:
We’ll show how a map–reduce pattern and structured metadata enable accurate, explainable summaries—without prompt guessing.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.