Across manufacturing, retail, healthcare, and financial services, critical technical data resides in unstructured documents. According to Gartner, around 80% of enterprise data is unstructured. Product manuals, supplier data sheets, and specification catalogs contain the information that teams need for product comparisons, sourcing decisions, and customer support. Yet this data has rarely been systematically extracted into structured databases. While almost every enterprise values unstructured data, based on Forbes the majority—about 7 out of 10—still find it difficult to manage.
Manual processing of these documents is too slow, error-prone, and costly at today's scale. Based on research from the European Data Protection Board (EDPB), traditional OCR could recognize text, but structured data extraction typically required rigid templates. If a document differed in structure, it would result in lower accuracy. Generative AI changes this equation fundamentally. Large language models understand context, layout, and structure. They can extract data from documents they have never encountered before, without requiring document-specific templates or training data. For more detailed insights, refer to G. Colakoglu et al. and C. Deng et al.
Consider a concrete scenario: a procurement team needs to compare cordless drill/drivers across four manufacturers. Each vendor provides a PDF manual in a different format. Some span 250+ pages, others are brief instruction sheets with minimal technical detail. Building a comparison spreadsheet from these documents would require a full day of manual effort, and the process would need to be repeated every time a new product enters the catalog.
This is the problem we set out to solve. The same pattern extends well beyond power tools: identifying the right toner for a printer, the compatible component for an assembly line, or the correct accessory for a device. In this article, we walk through every step of building this document extraction solution on Databricks, from parsing and extraction to evaluation and end-user interfaces.
For this article, we used publicly available product manuals from four cordless drill manufacturers as part of a Databricks Solution Accelerator available on GitHub.
We built our solution using Databricks' Intelligent Document Processing capabilities, leveraging two AI Functions to transform raw PDFs into a structured product catalog:
The pipeline is expressed entirely as streaming tables inside a Lakeflow Spark Declarative Pipeline, making it incremental and production-ready from day one, leveraging Autoloader for efficient file ingestion.
The architecture is intentionally straightforward: no custom model training and no separate inference endpoints. The entire pipeline can be expressed in Python or SQL, with extraction guided by simple prompting. Below, we describe the three steps of an Intelligent Document Processing pipeline.
Figure 1: Three steps for Intelligent Document Processing
To get started quickly, the Agent Bricks Information Extraction UI via the Agents Tab can be used to design the schema interactively and as a guide before converting it to code.
Figure 2: Executing the Intelligent Document Processing via the Agent Bricks UI
We start by reading PDFs as binary files from a Unity Catalog Volume, which provides governed file storage. In the UI, a few examples are processed for deeper analysis using the document parsing capability. The figure below shows how the pipeline processes text, tables, and images, generating descriptions for the figures shown, which can be used by downstream AI tools.
Figure 3: Left: Original Product Manual of a Power Tool, Right: Processed document in the UI
In the background, the ai_parse_document function processes each PDF and returns a structured representation of the document, including text, tables, figures, and layout information:
from pyspark import pipelines as dp
import pyspark.sql.functions as F
# I/O variables
table_prefix = spark.conf.get("table")
input_volume_path = f"{spark.conf.get('volume')}/productmanuals"
@dp.table(
name=f"{table_prefix}_productmanuals_parsed",
comment="Table containing parsed product manual data from PDF files, including file metadata",
)
def productmanuals_parsed():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.load(input_volume_path)
.withColumn(
"parsed",
F.ai_parse_document(
F.col("content"),
{"version": "2.0", "descriptionElementTypes": "*"},
),
)
.select(
F.col("_metadata.file_path").alias("path"),
F.col("_metadata.file_name").alias("file_name"),
F.col("_metadata.file_size").alias("file_size"),
F.col("parsed"),
)
)
By leveraging a Streaming Table with spark.readStream(), the pipeline becomes incremental. It applies streaming semantics in a declarative way, meaning only new PDFs landing in the volume get processed on the next pipeline run.
If PDFs originate in external systems such as SharePoint, Lakeflow Connect can ingest files into Unity Catalog, which can then be processed with ai_parse_document.
In the UI, a schema can be generated using simple prompting, with options to add instructions and evaluate the information extraction results ad hoc.
Figure 4: Configuring the Information Extraction in the UI
In the background, the ai_extract function takes the parsed document along with a declarative JSON schema that describes exactly which fields to extract, their data types, and guidance for how to locate them. The schema can also be configured using the UI.
import json
from pyspark import pipelines as dp
from pyspark.sql import functions as F
# I/O variables
table_prefix = spark.conf.get("table")
input_table_path = f"{table_prefix}_productmanuals_parsed"
input_column = "parsed"
output_column = "ai_result"
# Config variables
instructions = """
Extract product specifications from this power tool manual. Focus only on the English language sections. All extracted values (including product_name) must be in English. Look for technical data in specification tables, feature lists, and product descriptions throughout the entire document. If a specification is
mentioned anywhere in the document (not just in tables), extract it. If multiple models are listed, extract the primary model.
"""
schema = json.dumps(
{
"manufacturer": {
"type": "string",
"description": "Brand or manufacturer name, e.g. Bosch, Makita, BLACK+DECKER",
},
"model_number": {
"type": "string",
"description": "Product model number or identifier, e.g. GSR 18V-65, BCD382, DF033D",
},
"product_name": {
"type": "string",
"description": "Full product name or description, e.g. Cordless Drill/Driver, 20V MAX Cordless Drill",
},
"product_type": {
"type": "string",
"description": "Type of tool: drill, drill/driver, hammer drill, impact driver, etc.",
},
"rated_voltage_v": {"type": "number", "description": "Rated or nominal voltage in volts"},
}
)
@dp.table(
name=f"{table_prefix}_productmanuals_extract",
comment="Extracted product specifications (manufacturer, model, voltage, torque, etc.) via AI from parsed product manual PDFs",
)
def productmanuals_extract():
sql = f"""
ai_extract(
{input_column},
'{schema}',
map('version', '2.0', 'instructions', '{instructions}')
)
"""
return (
spark.readStream.table(input_table_path)
.withColumn(output_column, F.expr(sql))
.select("path", "file_name", "file_size", output_column)
)
The full schema defines fields covering manufacturer info, performance specifications, physical dimensions, and accessory compatibility. Two design decisions shaped the schema and are worth highlighting for anyone building their own extraction pipeline:
Schema and field descriptions act as extraction hints. Specifying "Use the hard screwdriving value if both hard and soft are given" for torque, or "If given in lbs, convert to kg" for weight, steers the LLM toward the right value with simple prompt engineering. Giving more context by describing torque as "May appear as max torque, tightening torque, or fastening torque" helps the LLM find the right value even when manufacturers use different terminology for the same specification.
The prompt instructs the AI function. In our case, some manuals contain content in multiple languages. Here, we added explicit instructions to extract only from English sections. We also instructed it to prioritize the specification tables in the documents.
The Agent Bricks UI can also serve as a guide when creating the ai_extract integration.
The final step flattens the result from ai_extract into strongly typed columns with descriptive column comments. These comments are important because downstream Databricks components such as Genie Space leverage them to generate more accurate results. Those transformations can be generated using Genie Code, an autonomous AI agent in Databricks designed for data engineering, science, and analytics.
from pyspark import pipelines as dp
import pyspark.sql.functions as F
# I/O variables
table_prefix = spark.conf.get("table")
input_table_path = f"{table_prefix}_productmanuals_extract"
# Config variables
schema = """
file_name STRING COMMENT 'Original PDF file name.',
manufacturer STRING COMMENT 'Brand or manufacturer name.',
model_number STRING COMMENT 'Product model number or identifier.',
product_name STRING COMMENT 'Full product name or description.',
product_type STRING COMMENT 'Type of tool (drill, drill/driver, hammer drill, etc.).',
rated_voltage_v DOUBLE COMMENT 'Rated voltage in volts.',
...
"""
@dp.table(
name=f"{table_prefix}_productmanuals_processed",
comment="Processed product catalog from power tool manuals: structured specifications for cross-vendor comparison, procurement intelligence, and product recommendation.",
schema=schema,
)
def productmanuals_processed():
return (
spark.readStream.table(input_table_path)
.select(
F.col("file_name"),
F.expr("ai_result:response.manufacturer::STRING").alias("manufacturer"),
F.expr("ai_result:response.model_number::STRING").alias("model_number"),
F.expr("ai_result:response.product_name::STRING").alias("product_name"),
F.expr("ai_result:response.product_type::STRING").alias("product_type"),
F.expr("ai_result:response.rated_voltage_v::DOUBLE").alias("rated_voltage_v"),
)
)
All three stages are expressed as streaming tables, making the pipeline incremental end to end. Running the pipeline on the four product manuals produces a structured product catalog ready for analysis:
Figure 5: The extraction results table in the Databricks Apps, showing structured fields extracted from each PDF.
These results illustrate an important practical consideration: extraction quality correlates directly with the quality of the source document. Professional-grade data sheets yield complete specifications, while consumer instruction manuals sometimes omit detailed technical data. The pipeline handles this appropriately, returning NULL where data is unavailable rather than generating fabricated values.
In the absence of ground truth labels, how do we measure whether the extraction is performing well? We use MLflow 3 GenAI evaluation with a combination of code-based scorers (checking field completeness and numeric plausibility) and LLM-as-judge scorers (verifying that all values are in English and that extracted manufacturers and model numbers are valid). These four scorers run in a single mlflow.genai.evaluate() call:
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[completeness, format_validator, english_check, extraction_quality]
)
Figure 6: MLflow evaluation results across four documents.
Results are logged to an MLflow experiment, establishing a continuous feedback loop: refine the prompt, re-run the pipeline, re-run the evaluation, and compare metrics across runs. This follows the same experiment tracking methodology that data scientists use for traditional ML development, now applied to GenAI extraction pipelines.
A structured table is only valuable if the right people can access it. We expose the extracted data through two complementary interfaces, each serving a different type of question.
Figure 7: End-to-end architecture, from PDF ingestion through extraction to productized interfaces on Genie (previously known as Databricks One).
The pipeline that parses, extracts, and post-processes the documents is orchestrated by Lakeflow Jobs. The entire solution is deployed via Declarative Automation Bundles for CI/CD.
A Genie Space allows business users to ask natural-language questions about the structured product catalog without writing SQL. Questions like "Which drill has the highest max torque?" or "Compare all drills by weight and voltage" are translated into SQL and executed against the processed table. Because we invested in descriptive column comments during the processing step, Genie Space understands the semantics of each field and generates more accurate queries.
Not all questions can be answered from the extracted fields. Safety instructions, maintenance procedures, troubleshooting steps, and warranty terms live in the original documents. A Knowledge Assistant built with Agent Bricks indexes the raw PDF manuals from the same Unity Catalog Volume and provides cited, document-grounded answers to these open-ended questions. The incremental refresh is also executed by Lakeflow Jobs, leveraging the Agent Bricks REST API.
A Supervisor Agent ties the Genie Space and Knowledge Assistant into one interface, allowing users to ask questions about structured and unstructured data. It can be surfaced to end users using Databricks Apps as the user interface.
Databricks Apps enables developers to build and deploy secure data and AI applications directly on the Databricks platform, which eliminates the need for separate infrastructure. Apps run on the serverless platform and integrate with key platform services, including Unity Catalog for data governance, Databricks SQL for querying data, and OAuth for authentication.
Figure 8: Databricks Apps provides a self-service interface for uploading PDFs and triggering the extraction pipeline.
Figure 9: The Supervisor Agent responds to "Compare all drills by weight and voltage" with a structured table drawn from the extracted data.
Genie (previously known as Databricks One) is a user interface designed for business users, giving them a single, intuitive entry point to interact with data and AI in Databricks. From there, users can ask data questions in natural language via Genie Spaces, interact with custom-built Databricks Apps that combine analytics, AI, and workflows, and view AI/BI dashboards to track KPIs and analyze metrics.
Figure 10: Genie surfaces Genie Spaces, Apps, and dashboards as a single entry point for business users.
The architecture we presented is not limited to product manuals. The same pattern applies to any scenario where structured data needs to be extracted from unstructured documents:
For organizations looking to extend this solution, here are natural next steps:
Explore the Databricks Solution Accelerator for this solution on GitHub, or speak with your Databricks representative to start building your own AI-powered Intelligent Document Processing pipeline today.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.