Every organization has critical information trapped in PDFs and unstructured documents: forms, reports, records, filings. Historically, turning those files into usable data has meant manual data entry, brittle OCR scripts, or single-purpose tools that don't scale.
On Databricks, you can treat documents like any other data source. Ingest them reliably, enrich them with AI, productionize the workflow, and expose the results to business users, all in one place.
This post walks through a step-by-step pattern for building a production-grade Intelligent Document Processing (IDP) pipeline using Lakeflow Connect, Lakeflow Spark Declarative Pipelines, Lakeflow Jobs, and Document Intelligence. You can learn more about the broader architecture and strategy in Building with Databricks Document Intelligence and Lakeflow.
Rather than managing a complex web of disconnected tools, this solution uses Lakeflow to create a simple, unified environment for document intelligence. We take raw, unstructured documents from your source and transform them into production-ready insights, all within the Databricks Platform.
You can watch the video walkthrough above, or follow the step-by-step guide below.
The starting point is a set of PDFs sitting in a SharePoint document library: the kind of unstructured files every organization accumulates, filled out by hand, varying in format, and difficult to process programmatically.
In the companion video embedded above, we use aircraft maintenance log books as a concrete example. However, the same pattern applies to any document type: invoices, claims forms, engineering specs, medical records, inspection reports, or any other PDF your teams produce and store.
These are the documents we’ll be pulling into our pipeline to extract rich insights using Lakeflow and Document Intelligence. What makes this set particularly challenging is their variability, as they contain a mix of handwritten and printed text, tables with inconsistent layouts, and key information buried in different locations across every file. While traditional OCR tools often struggle with this level of complexity, the combination of a managed pipeline and native AI Functions allows us to handle these challenges reliably and at scale.
Lakeflow Connect provides two ways to ingest documents from SharePoint: a wizard-driven UI and a SQL API. Both create a fully managed, incremental ingestion pipeline.
From the Databricks data ingestion page, select ‘SharePoint’ as the source. The wizard walks you through five steps:
That's the entire setup. No incremental logic scripts, no building connectors from scratch. Lakeflow Connect manages all of it.
Note: The ingestion wizard UI is being rolled out and may not be available in all workspaces yet. If you don’t see it, use the SQL API below.
You can also define the ingestion pipeline directly in SQL within a Lakeflow Spark Declarative Pipeline. Just provide the URL to your folder of documents and reference an existing SharePoint Unity Catalog connection. If you don't have a connection yet, follow these steps to create one.
CREATE OR REFRESH STREAMING TABLE main.idp_demo.sharepoint_pdfs
AS SELECT *, _metadata FROM STREAM read_files(
"https://mytenant.sharepoint.com/sites/MySite/Shared%20Documents",
format => "binaryFile",
`databricks.connection` => "my_sharepoint_conn",
pathGlobFilter => "*.pdf"
);
The STREAMING TABLE keyword means the pipeline is incremental: each run only picks up files that are new or modified since the last run. The binaryFile format preserves the raw PDF content as binary data. The _metadata column captures additional file-level information from the source.
The output is a table where each row represents one document. Key columns:
Inspect the ingested table in Catalog Explorer to confirm what landed. You'll see one row per document, with the binary content and metadata described above.
Document intelligence on Databricks is powered by two AI Functions: ai_parse_document (converts raw PDFs into structured content) and ai_extract (pulls out specific fields you define). You can work with these functions in two ways:
The typical workflow is to start in the UI, get the extraction working, then continue iterating on the generated SQL (for example, to validate programmatically against ground truth data or to customize the pipeline logic).
Once satisfied, productionize with a single click (covered in Step 4).
Whether you're working with SQL generated by Document Intelligence or writing it from scratch, the code is straightforward.
Parse documents with ai_parse_document:
CREATE OR REFRESH STREAMING TABLE main.idp_demo.documents_parsed
TBLPROPERTIES (
'delta.feature.variantType-preview' = 'supported'
)
AS
SELECT
path,
ai_parse_document(content, MAP('version', '2.0')) AS parsed_content
FROM main.idp_demo.sharepoint_pdfs;
ai_parse_document takes the raw binary content of each PDF and research-backed techniques to read through the document: handwriting, printed text, tables, figures, mixed layouts. It returns a structured VARIANT output containing the full parsed content, page-level information, and element-level details (text, tables, section headers, figures, captions, bounding boxes, and more). One function call, any document format.
Extract fields with ai_extract:
CREATE OR REFRESH STREAMING TABLE main.idp_demo.documents_extracted
TBLPROPERTIES (
'delta.feature.variantType-preview' = 'supported'
)
AS
SELECT
path,
ai_extract(
parsed_content,
'{
"field_1": {"type": "string", "description": "Description of field 1"},
"field_2": {"type": "string", "description": "Description of field 2"},
"field_3": {"type": "number", "description": "Description of field 3"},
"field_4": {"type": "number", "description": "Description of field 4"}
}'
) AS extracted
FROM main.idp_demo.documents_parsed;
ai_extract takes the VARIANT output from ai_parse_document and applies a JSON schema you define. Replace the placeholder fields above with your actual extraction targets (dates, names, durations, identifiers, or any fields relevant to your documents). The function returns a structured VARIANT with a response object containing the extracted values for each document.
Quick exploration (chained in a single query):
For ad-hoc exploration before productionizing, you can chain both functions in one query:
WITH parsed AS (
SELECT
path,
ai_parse_document(content, MAP('version', '2.0')) AS parsed_content
FROM main.idp_demo.sharepoint_pdfs
)
SELECT
path,
ai_extract(
parsed_content,
'{
"field_1": {"type": "string", "description": "Description of field 1"},
"field_2": {"type": "string", "description": "Description of field 2"},
"field_3": {"type": "number", "description": "Description of field 3"}
}'
) AS extracted
FROM parsed;
This is useful for testing your schema against a few documents before committing to a full pipeline.
Whether you used Document Intelligence or wrote the SQL yourself, the next step is automation. This is where Spark Declarative Pipelines (SDP) really shine. Instead of manually writing hundreds of lines of procedural logic, you can focus on writing your business logic in SQL and SDP will automatically convert that into a managed production-grade pipeline. SDP handles the heavy lifting with efficient serverless compute and incremental processing by default while providing built-in observability.
If you used the Document Intelligence UI in Step 4, click Use Agent and select Lakeflow Spark Declarative Pipeline. Databricks auto-generates a complete pipeline with the SQL already written for you. No additional work needed.
If you wrote the SQL yourself, those CREATE OR REFRESH STREAMING TABLE statements from Step 3 are already pipeline-ready. Place them in a Lakeflow Spark Declarative Pipeline, and you have a production pipeline.
Either way, the pipeline contains two STREAMING TABLEs:
Both tables are incremental. Each time the pipeline runs, it checks for new rows in the ingestion table (new PDFs streamed in from SharePoint) and only processes those. Documents that have already been parsed are not re-parsed. Documents that have already been extracted are not re-extracted. This is what makes the pattern cost-effective at scale: you are not re-running AI models over documents you've already processed.
The result is a gold table with clean, structured data extracted from your PDFs, stored as a standard table in Unity Catalog that plugs into anything downstream.
The gold table is a governed table with structured data extracted from your unstructured documents. What you build on top of it depends on your use case:
In the embedded video above, we show an AI/BI Dashboard as one example. The dashboard is built on top of the gold table and surfaces visualizations (activity over time, workload distribution, compliance metrics), all powered by data extracted from those original PDFs. Because it's an AI/BI Dashboard, users can ask questions in natural language without needing to know SQL.
The dashboard is just one starting point. The extracted data table is the foundation.
The final step is tying everything together into one automated workflow using Lakeflow Jobs.
Create a job with three tasks chained in sequence:
Each time this job runs, the entire chain executes end-to-end. If no new files have arrived in SharePoint, the ingestion step completes quickly and nothing downstream re-processes. If ten new PDFs landed, only those ten flow through parsing and extraction.
Set a daily trigger, configure failure notifications so the right people know if something breaks, and that's it. Set it and forget it. No custom orchestration code, no fragile scripts to maintain.
Let’s review the full end-to-end flow of the solution, which automates everything from ingestion to final insights:
No custom ingestion code. No external parsing services. No manual orchestration. The entire pipeline, from PDF to dashboard, runs on Databricks.
To get started, check out the Lakeflow Connect SharePoint documentation, the ai_parse_document reference, and the ai_extract reference. Start with a single document type, prove the pattern, and expand from there.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.