cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Giselle_Go_DB
Databricks Employee
Databricks Employee

From-PDF-to-Insights-Graphic-v2.png

Every organization has critical information trapped in PDFs and unstructured documents: forms, reports, records, filings. Historically, turning those files into usable data has meant manual data entry, brittle OCR scripts, or single-purpose tools that don't scale.

On Databricks, you can treat documents like any other data source. Ingest them reliably, enrich them with AI, productionize the workflow, and expose the results to business users, all in one place.

This post walks through a step-by-step pattern for building a production-grade Intelligent Document Processing (IDP) pipeline using Lakeflow Connect, Lakeflow Spark Declarative Pipelines, Lakeflow Jobs, and Document Intelligence. You can learn more about the broader architecture and strategy in Building with Databricks Document Intelligence and Lakeflow.

Rather than managing a complex web of disconnected tools, this solution uses Lakeflow to create a simple, unified environment for document intelligence. We take raw, unstructured documents from your source and transform them into production-ready insights, all within the Databricks Platform.

You can watch the video walkthrough above, or follow the step-by-step guide below.

Step 1: Select Source Documents

The starting point is a set of PDFs sitting in a SharePoint document library: the kind of unstructured files every organization accumulates, filled out by hand, varying in format, and difficult to process programmatically.

In the companion video embedded above, we use aircraft maintenance log books as a concrete example. However, the same pattern applies to any document type: invoices, claims forms, engineering specs, medical records, inspection reports, or any other PDF your teams produce and store.

These are the documents we’ll be pulling into our pipeline to extract rich insights using Lakeflow and Document Intelligence. What makes this set particularly challenging is their variability, as they contain a mix of handwritten and printed text, tables with inconsistent layouts, and key information buried in different locations across every file. While traditional OCR tools often struggle with this level of complexity, the combination of a managed pipeline and native AI Functions allows us to handle these challenges reliably and at scale.

Step 2: Ingest with Lakeflow Connect

Lakeflow Connect provides two ways to ingest documents from SharePoint: a wizard-driven UI and a SQL API. Both create a fully managed, incremental ingestion pipeline.

Ingestion UI

1.png

From the Databricks data ingestion page, select ‘SharePoint’ as the source. The wizard walks you through five steps:

  1. Connection: Select an existing Unity Catalog connection to your SharePoint environment, or create a new one. You'll need your SharePoint site ID, tenant ID, client ID, and client secret. These credentials are stored securely in Unity Catalog.
  2. Ingestion setup: Name the pipeline and select ‘unstructured data’ as the data type (since you're working with PDFs).
  3. Select source: Provide the URL to your SharePoint document library. Browse and select the files you want to ingest.
  4. Destination: Choose the destination catalog, schema, and table name where the files will land.
  5. Schedule: Set a cadence for the pipeline to check for new or modified files. A daily schedule is typical.

That's the entire setup. No incremental logic scripts, no building connectors from scratch. Lakeflow Connect manages all of it.

Note: The ingestion wizard UI is being rolled out and may not be available in all workspaces yet. If you don’t see it, use the SQL API below.

Ingestion SQL API

You can also define the ingestion pipeline directly in SQL within a Lakeflow Spark Declarative Pipeline. Just provide the URL to your folder of documents and reference an existing SharePoint Unity Catalog connection. If you don't have a connection yet, follow these steps to create one.

CREATE OR REFRESH STREAMING TABLE main.idp_demo.sharepoint_pdfs
AS SELECT *, _metadata FROM STREAM read_files(
  "https://mytenant.sharepoint.com/sites/MySite/Shared%20Documents",
  format => "binaryFile",
  `databricks.connection` => "my_sharepoint_conn",
  pathGlobFilter => "*.pdf"
);

The STREAMING TABLE keyword means the pipeline is incremental: each run only picks up files that are new or modified since the last run. The binaryFile format preserves the raw PDF content as binary data. The _metadata column captures additional file-level information from the source.

The result

2.png

The output is a table where each row represents one document. Key columns:

  • path: the file path from SharePoint
  • modificationTime: when the file was last updated
  • length: file size in bytes
  • content: the raw binary content of the PDF itself

Inspect the ingested table in Catalog Explorer to confirm what landed. You'll see one row per document, with the binary content and metadata described above.

Step 3: Parse and Extract with Document Intelligence

Document intelligence on Databricks is powered by two AI Functions: ai_parse_document (converts raw PDFs into structured content) and ai_extract (pulls out specific fields you define). You can work with these functions in two ways:

  1. Document Intelligence UI: A visual interface for configuring, previewing, and iterating on your extraction. Define your schema in natural language, see results per document, and refine until the output looks right. When you're done, Document Intelligence generates the ai_parse_document and ai_extract SQL for you automatically.
  2. SQL API: Call ai_parse_document and ai_extract directly in SQL for full control over the pipeline logic.

The typical workflow is to start in the UI, get the extraction working, then continue iterating on the generated SQL (for example, to validate programmatically against ground truth data or to customize the pipeline logic).

Document Intelligence (UI)

3.png

  1. Navigate to Agents in the Databricks sidebar and click Create Agent.
  2. Filter by unstructured and select Information Extraction.
  3. Select your ingestion table and choose the content column (the binary PDF data).
  4. Click Create Agent to spin up the extraction environment.
  5. Describe the information you want to extract in natural language (for example, "the date the work was performed, the name of the person who performed it, the total duration, and any relevant identifiers"). Click Generate Schema and the agent converts your description into a structured extraction schema.
  6. Preview results for each document. The agent shows the parsed PDF content alongside the extracted fields. Click through multiple files to verify consistency. Provide feedback to the agent to intelligently auto-update your schema and iterate on quality.

Once satisfied, productionize with a single click (covered in Step 4).

AI Functions (SQL API)

Whether you're working with SQL generated by Document Intelligence or writing it from scratch, the code is straightforward.

Parse documents with ai_parse_document:

CREATE OR REFRESH STREAMING TABLE main.idp_demo.documents_parsed 
TBLPROPERTIES (
  'delta.feature.variantType-preview' = 'supported'
)
AS
SELECT
  path,
  ai_parse_document(content, MAP('version', '2.0')) AS parsed_content
FROM main.idp_demo.sharepoint_pdfs;

ai_parse_document takes the raw binary content of each PDF and research-backed techniques to read through the document: handwriting, printed text, tables, figures, mixed layouts. It returns a structured VARIANT output containing the full parsed content, page-level information, and element-level details (text, tables, section headers, figures, captions, bounding boxes, and more). One function call, any document format.

Extract fields with ai_extract:

CREATE OR REFRESH STREAMING TABLE main.idp_demo.documents_extracted 
TBLPROPERTIES (
  'delta.feature.variantType-preview' = 'supported'
)
AS
SELECT
  path,
  ai_extract(
    parsed_content,
    '{
      "field_1": {"type": "string", "description": "Description of field 1"},
      "field_2": {"type": "string", "description": "Description of field 2"},
      "field_3": {"type": "number", "description": "Description of field 3"},
      "field_4": {"type": "number", "description": "Description of field 4"}
    }'
  ) AS extracted
FROM main.idp_demo.documents_parsed;

ai_extract takes the VARIANT output from ai_parse_document and applies a JSON schema you define. Replace the placeholder fields above with your actual extraction targets (dates, names, durations, identifiers, or any fields relevant to your documents). The function returns a structured VARIANT with a response object containing the extracted values for each document.


Quick exploration (chained in a single query):

For ad-hoc exploration before productionizing, you can chain both functions in one query:

WITH parsed AS (
  SELECT
    path,
    ai_parse_document(content, MAP('version', '2.0')) AS parsed_content
  FROM main.idp_demo.sharepoint_pdfs
)
SELECT
  path,
  ai_extract(
    parsed_content,
    '{
      "field_1": {"type": "string", "description": "Description of field 1"},
      "field_2": {"type": "string", "description": "Description of field 2"},
      "field_3": {"type": "number", "description": "Description of field 3"}
    }'
  ) AS extracted
FROM parsed;

This is useful for testing your schema against a few documents before committing to a full pipeline.

Step 4: Productionize with a Lakeflow Spark Declarative Pipeline

Whether you used Document Intelligence or wrote the SQL yourself, the next step is automation. This is where Spark Declarative Pipelines (SDP) really shine. Instead of manually writing hundreds of lines of procedural logic, you can focus on writing your business logic in SQL and SDP will automatically convert that into a managed production-grade pipeline. SDP handles the heavy lifting with efficient serverless compute and incremental processing by default while providing built-in observability.

From Document Intelligence

4.png

If you used the Document Intelligence UI in Step 4, click Use Agent and select Lakeflow Spark Declarative Pipeline. Databricks auto-generates a complete pipeline with the SQL already written for you. No additional work needed.

From code

5.png

If you wrote the SQL yourself, those CREATE OR REFRESH STREAMING TABLE statements from Step 3 are already pipeline-ready. Place them in a Lakeflow Spark Declarative Pipeline, and you have a production pipeline.

What the pipeline does

Either way, the pipeline contains two STREAMING TABLEs:

  • Silver table (documents_parsed ) : Reads from the ingestion streaming table and runs ai_parse_document to convert raw binary PDFs into structured content.
  • Gold table (documents_extracted): Reads from the silver table and runs ai_extract to pull out the specific fields defined in your schema.

Both tables are incremental. Each time the pipeline runs, it checks for new rows in the ingestion table (new PDFs streamed in from SharePoint) and only processes those. Documents that have already been parsed are not re-parsed. Documents that have already been extracted are not re-extracted. This is what makes the pattern cost-effective at scale: you are not re-running AI models over documents you've already processed.

The result is a gold table with clean, structured data extracted from your PDFs, stored as a standard table in Unity Catalog that plugs into anything downstream.

Step 5: Use Cases and Dashboard

The gold table is a governed table with structured data extracted from your unstructured documents. What you build on top of it depends on your use case:

  • Chatbots and assistants (such as Databricks Genie or Databricks Knowledge Assistant😞 Let teams ask natural language questions about the documents.
  • AI agents or agentic workflows: Automate compliance checks, routing, or classification based on extracted fields.
  • Search: Build search applications over parsed document content.
  • Unified analytics: Combine extracted document data with your existing structured data for cross-source reporting.
  • Dashboards: Visualize extracted metrics for operational teams.

In the embedded video above, we show an AI/BI Dashboard as one example. The dashboard is built on top of the gold table and surfaces visualizations (activity over time, workload distribution, compliance metrics), all powered by data extracted from those original PDFs. Because it's an AI/BI Dashboard, users can ask questions in natural language without needing to know SQL.

The dashboard is just one starting point. The extracted data table is the foundation.

Step 6: Orchestrate with Lakeflow Jobs

The final step is tying everything together into one automated workflow using Lakeflow Jobs.

6.png

Create a job with three tasks chained in sequence:

  1. Ingest: Run the Lakeflow Connect pipeline. It checks SharePoint for new or modified files and incrementally ingests them.
  2. Parse and Extract: Run the Spark Declarative Pipeline. It parses the newly ingested documents and extracts the defined fields.
  3. Refresh Dashboard: Update the AI/BI Dashboard so it reflects the latest data.

Each time this job runs, the entire chain executes end-to-end. If no new files have arrived in SharePoint, the ingestion step completes quickly and nothing downstream re-processes. If ten new PDFs landed, only those ten flow through parsing and extraction.

Set a daily trigger, configure failure notifications so the right people know if something breaks, and that's it. Set it and forget it. No custom orchestration code, no fragile scripts to maintain.

Conclusion

Let’s review the full end-to-end flow of the solution, which automates everything from ingestion to final insights:

  1. Select unstructured documents from SharePoint as the source.
  2. Ingest using Lakeflow Connect. Simple UI, fully managed, incremental ingestion.
  3. Parse and extract structured fields using AI Functions and Document Intelligence. Interactive preview, no external parsing tools.
  4. Productionize into a Spark Declarative Pipeline. Auto-generated or hand-written, fully incremental at every step.
  5. Build downstream use cases: dashboards, chatbots, agents, search, analytics.
  6. Orchestrate the entire flow with Lakeflow Jobs. One automated workflow, one daily trigger.

No custom ingestion code. No external parsing services. No manual orchestration. The entire pipeline, from PDF to dashboard, runs on Databricks.

To get started, check out the Lakeflow Connect SharePoint documentation, the ai_parse_document reference, and the ai_extract reference. Start with a single document type, prove the pattern, and expand from there.